CN115129949A

CN115129949A - Vector range retrieval method, device, equipment, medium and program product

Info

Publication number: CN115129949A
Application number: CN202210758402.2A
Authority: CN
Inventors: 谢超; 许维芷; 程倩雅; 易小萌
Original assignee: Shanghai Xuyu Intelligent Technology Co ltd
Current assignee: Shanghai Xuyu Intelligent Technology Co ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-30

Abstract

The application relates to the field of data retrieval, and discloses a vector range retrieval method, a device, equipment, a medium and a program product. The vector range retrieval method comprises the following steps: acquiring a query vector and a retrieval radius; loading the graph index and the compressed vector from a first storage area with slow read/write speed to a second storage area with fast read/write speed; determining a target representative vector closest to the query vector in the graph index according to the query vector and the retrieval radius in the second storage area; acquiring a compressed vector corresponding to the target representative vector in a second storage area, and determining the target compressed vector according to the distance between the query vector and the compressed vector; and acquiring an original vector corresponding to the target compressed backward vector from the first storage area according to the target compressed vector, and determining the target original vector according to the distance between the query vector and the original vector, so that the speed and efficiency of vector range retrieval can be improved, and the accuracy of vector range retrieval is improved.

Description

Method, apparatus, apparatus, medium and program product for vector range retrieval

技术领域technical field

本申请涉及数据检索领域，特别涉及一种向量范围检索的方法、装置、设备、介质及程序产品。The present application relates to the field of data retrieval, and in particular, to a method, apparatus, device, medium and program product for vector range retrieval.

背景技术Background technique

当前，信息检索技术在图像检索、机器学习、数据挖掘等领域得到广泛应用。通过将信息抽象为高维度的特征向量，信息之间的相似度可以量化为特征向量之间的距离，信息检索可以转化为向量空间中的最近邻检索(Nearest Neighbor Search，NNS)。At present, information retrieval technology is widely used in image retrieval, machine learning, data mining and other fields. By abstracting information into high-dimensional feature vectors, the similarity between information can be quantified as the distance between feature vectors, and information retrieval can be transformed into Nearest Neighbor Search (NNS) in vector space.

由于在海量、高维度的向量检索场景中最近邻检索算法的性能会急剧降低，通常使用近似最近邻检索(Approximate Nearest Neighbor Search，ANNS)作为替代算法，近似最近邻检索能够在可接受范围的精确度损失下，快速得到与查询向量相似的多个结果向量。Since the performance of the nearest neighbor retrieval algorithm will drop sharply in the massive and high-dimensional vector retrieval scenarios, Approximate Nearest Neighbor Search (ANNS) is usually used as an alternative algorithm. Approximate nearest neighbor retrieval can be accurate within an acceptable range. Under the loss of degree, multiple result vectors similar to the query vector are quickly obtained.

近似最近邻检索可以分为KNN检索和范围检索(Range Search)，KNN检索固定返回K个与查询向量最接近的结果向量，范围检索则通过指定检索半径，返回以查询向量为中心，与查询向量的距离在检索半径内的所有向量，范围检索返回的结果向量的数目不确定。Approximate nearest neighbor retrieval can be divided into KNN retrieval and range retrieval. KNN retrieval fixedly returns K result vectors that are closest to the query vector. The distance retrieves all vectors within the radius, and the range retrieval returns an indeterminate number of result vectors.

现有范围检索的方法通常是在KNN的基础上通过增大K值，即增加返回结果向量的数目来实现，由于不同的检索得到的结果向量的数目变化很大，很难为范围检索确定一个固定的K值，如果设置一个较小的K值，在结果向量的数目大于K值时，导致返回结果不完整，会降低准确率，如果设置一个较大的K值，虽然能够得到较完整的结果向量，但是在结果向量的数目较小甚至为零的情况下，会浪费大量的检索时间，导致检索效率降低。The existing range retrieval methods are usually implemented by increasing the K value on the basis of KNN, that is, increasing the number of returned result vectors. Since the number of result vectors obtained by different retrievals varies greatly, it is difficult to determine a fixed range for range retrieval. If you set a smaller K value, when the number of result vectors is greater than the K value, the returned result will be incomplete, which will reduce the accuracy. If you set a larger K value, although a more complete result can be obtained However, when the number of result vectors is small or even zero, a lot of retrieval time will be wasted, resulting in lower retrieval efficiency.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种向量范围检索的方法、装置、设备、介质及程序产品，用于解决现有技术中进行向量的范围检索时检索结果不完整或检索效率低的问题。The embodiments of the present application provide a method, apparatus, device, medium and program product for vector range retrieval, which are used to solve the problems of incomplete retrieval results or low retrieval efficiency when performing vector range retrieval in the prior art.

第一方面，本申请实施例提供了一种向量范围检索的方法，用于电子设备，该方法包括：In a first aspect, an embodiment of the present application provides a vector range retrieval method, which is used in an electronic device, and the method includes:

获取查询向量和检索半径；Get the query vector and retrieval radius;

将图索引和压缩后向量从第一存储区加载到第二存储区，第一存储区的读/写速度小于第二存储区；Load the graph index and the compressed vector from the first storage area to the second storage area, and the read/write speed of the first storage area is lower than that of the second storage area;

在第二存储区至少根据查询向量和检索半径，确定图索引中与查询向量距离最接近的至少一个目标代表向量；In the second storage area, at least one target representative vector whose distance is closest to the query vector in the graph index is determined according to at least the query vector and the retrieval radius;

在第二存储区获取至少一个目标代表向量对应的压缩后向量，并至少根据查询向量与压缩后向量之间的距离确定目标压缩后向量；Obtain a compressed vector corresponding to at least one target representative vector in the second storage area, and determine the target compressed vector at least according to the distance between the query vector and the compressed vector;

根据目标压缩后向量，从第一存储区获取与目标压缩后向量对应的原始向量，并根据查询向量与所述原始向量之间的距离确定目标原始向量。According to the target compressed vector, the original vector corresponding to the target compressed vector is obtained from the first storage area, and the target original vector is determined according to the distance between the query vector and the original vector.

可以理解，通常的向量范围检索的方法仅能通过增大结果向量的数目来实现，存在返回结果不完整或检索效率低的问题。然而通过本申请的向量范围检索方案，能够实现返回结果完整且检索效率较高的方案，避免现有方案的缺点。It can be understood that the usual method of vector range retrieval can only be realized by increasing the number of result vectors, and there are problems that the returned results are incomplete or the retrieval efficiency is low. However, through the vector range retrieval solution of the present application, a solution with complete returned results and high retrieval efficiency can be realized, avoiding the shortcomings of the existing solution.

在上述第一方面的一种可能的实现中，至少根据查询向量和检索半径，确定图索引中与查询向量距离最接近的至少一个目标代表向量，包括：In a possible implementation of the above-mentioned first aspect, at least one target representative vector in the graph index with the closest distance to the query vector is determined according to at least the query vector and the retrieval radius, including:

确定查询向量以及检索半径，并且进一步确定向量数据集的最大半径；Determine the query vector and retrieval radius, and further determine the maximum radius of the vector data set;

通过广度优先方式获取图索引中的代表向量，并确定与查询向量的距离小于检索半径与向量数据集的最大半径之和的至少一个目标代表向量。The representative vector in the graph index is obtained in a breadth-first manner, and at least one target representative vector whose distance from the query vector is smaller than the sum of the retrieval radius and the maximum radius of the vector dataset is determined.

通过上述方法，可以将包括可能的检索结果的所有代表向量都确定为目标代表向量，避免目标代表向量的数量不足导致遗漏可能的检索结果。Through the above method, all representative vectors including possible retrieval results can be determined as target representative vectors, so as to avoid omission of possible retrieval results due to insufficient number of target representative vectors.

在上述第一方面的一种可能的实现中，至少根据查询向量与压缩后向量之间的距离确定目标压缩后向量，包括：In a possible implementation of the above first aspect, the target compressed vector is determined at least according to the distance between the query vector and the compressed vector, including:

计算查询向量与压缩后向量之间的乘积量化距离；Calculate the product quantized distance between the query vector and the compressed vector;

确定乘积量化距离小于检索半径的预设倍数的目标压缩后向量。Determines the target compressed vector whose product quantization distance is less than a preset multiple of the retrieval radius.

通过上述方法，可以减少压缩后向量所占用的存储空间，同时提供了一种计算复杂度增加不大的距离计算方法，并且可以根据应用实践经验确定原始向量压缩后的信息损失情况，从而提高范围检索的准确率。Through the above method, the storage space occupied by the compressed vector can be reduced, and a distance calculation method with little increase in computational complexity is provided, and the information loss after compression of the original vector can be determined according to practical experience, thereby improving the range retrieval accuracy.

在上述第一方面的一种可能的实现中，根据查询向量与原始向量之间的距离确定目标原始向量，包括：In a possible implementation of the above first aspect, the target original vector is determined according to the distance between the query vector and the original vector, including:

计算查询向量与原始向量之间的距离；Calculate the distance between the query vector and the original vector;

确定距离小于检索半径的目标原始向量。Determines the target original vector whose distance is less than the retrieval radius.

通过上述方法，直接根据查询向量和检索半径确定检索结果，能够提高检索结果的准确性。Through the above method, the retrieval result is directly determined according to the query vector and the retrieval radius, which can improve the accuracy of the retrieval result.

在上述第一方面的一种可能的实现中，该方法还包括：In a possible implementation of the above-mentioned first aspect, the method further includes:

向第二存储区返回目标原始向量。Returns the destination original vector to the second storage area.

在上述第一方面的一种可能的实现中，返回目标原始向量，包括：In a possible implementation of the above-mentioned first aspect, the target original vector is returned, including:

将目标原始向量转换为非结构化数据，并返回非结构化数据。Convert the target raw vector to unstructured data and return unstructured data.

通过上述方法，可以将用户不易理解的向量转换为用户易于理解的文本、视频、音频等非结构化数据，方便用户阅读或识别，能够提高用户体验。Through the above method, a vector that is not easy for users to understand can be converted into unstructured data such as text, video, and audio that are easy for users to understand, which is convenient for users to read or recognize, and can improve user experience.

本申请实施例中提供了用于电子设备的向量范围检索的方法，该方法通过获取查询向量和检索半径，并将图索引和压缩后向量从读/写速度慢的第一存储区加载到读/写速度快的第二存储区，随后在第二存储区根据查询向量和检索半径确定图索引中与查询向量距离最接近的目标代表向量，并在第二存储区获取该目标代表向量对应的压缩后向量，再根据查询向量与压缩后向量之间的距离确定目标压缩后向量，最后根据目标压缩后向量从第一存储区获取与目标压缩后向量对应的原始向量，并根据查询向量与原始向量之间的距离确定目标原始向量，从而能够提高向量范围检索的速度和效率，同时提高向量范围检索的准确率。The embodiment of the present application provides a method for vector range retrieval of electronic devices. The method obtains a query vector and a retrieval radius, and loads the graph index and the compressed vector from the first storage area with slow read/write speed to the read/write speed. / The second storage area with fast writing speed, then determine the target representative vector in the graph index with the closest distance to the query vector in the second storage area according to the query vector and the retrieval radius, and obtain the corresponding target representative vector in the second storage area. The compressed vector, and then determine the target compressed vector according to the distance between the query vector and the compressed vector, and finally obtain the original vector corresponding to the target compressed vector from the first storage area according to the target compressed vector, and according to the query vector and the original vector. The distance between the vectors determines the target original vector, so that the speed and efficiency of vector range retrieval can be improved, and the accuracy of vector range retrieval can be improved at the same time.

第二方面，本申请实施例提供了一种构建向量索引的方法，用于电子设备，该方法包括：In a second aspect, an embodiment of the present application provides a method for constructing a vector index for use in an electronic device, and the method includes:

获取向量数据集；get vector dataset;

根据向量数据集确定多个聚类及聚类的代表向量；Determine multiple clusters and representative vectors of clusters according to the vector data set;

基于代表向量建立图索引；Create a graph index based on the representative vector;

获取代表向量对应的原始向量，并对原始向量进行压缩得到压缩后向量。Obtain the original vector corresponding to the representative vector, and compress the original vector to obtain the compressed vector.

在上述第二方面的一种可能的实现中，基于代表向量建立图索引，包括：In a possible implementation of the above second aspect, establishing a graph index based on the representative vector includes:

通过分层可导航小世界算法为代表向量建立图索引。Graph indices are built for representative vectors through a hierarchical navigable small-world algorithm.

在上述第二方面的一种可能的实现中，根据向量数据集确定多个聚类及聚类的代表向量之后，还包括：In a possible implementation of the above second aspect, after the multiple clusters and the representative vectors of the clusters are determined according to the vector data set, the method further includes:

获取向量数据集中多个聚类的聚类半径，并将聚类半径的最大值作为向量数据集的最大半径。Get the cluster radius of multiple clusters in the vector dataset, and use the maximum value of the cluster radius as the maximum radius of the vector dataset.

在上述第二方面的一种可能的实现中，获取向量数据集中多个聚类的聚类半径，包括：In a possible implementation of the above-mentioned second aspect, the clustering radii of multiple clusters in the vector data set are obtained, including:

分别计算聚类的代表向量与聚类中多个所述原始向量之间的距离，将多个距离中的最大值作为聚类半径。The distances between the representative vector of the cluster and the multiple original vectors in the cluster are calculated respectively, and the maximum value among the multiple distances is used as the cluster radius.

在上述第二方面的一种可能的实现中，对原始向量进行压缩得到压缩后向量，包括：In a possible implementation of the above second aspect, the compressed vector is obtained by compressing the original vector, including:

通过乘积量化方法对原始向量进行压缩，得到原始向量对应的压缩后向量。The original vector is compressed by the product quantization method, and the compressed vector corresponding to the original vector is obtained.

在上述第二方面的一种可能的实现中，通过乘积量化方法对原始向量进行压缩，包括：In a possible implementation of the above second aspect, the original vector is compressed by a product quantization method, including:

获取向量数据集中多个原始向量的多个第一子向量；Obtain multiple first sub-vectors of multiple original vectors in the vector dataset;

根据多个第一子向量确定多个子向量聚类及子向量聚类对应的子中心点向量；Determine a plurality of sub-vector clusters and sub-center point vectors corresponding to the sub-vector clusters according to the plurality of first sub-vectors;

确定与原始向量的第一子向量的距离最近的第一子中心点向量，并将第一子中心点向量的编号确定为第一子向量对应的压缩后向量；Determine the first sub-center point vector with the closest distance to the first sub-vector of the original vector, and determine the number of the first sub-center point vector as the compressed vector corresponding to the first sub-vector;

将原始向量中每个子向量对应的压缩后向量的组合确定为原始向量对应的压缩后向量。The combination of the compressed vectors corresponding to each sub-vector in the original vector is determined as the compressed vector corresponding to the original vector.

本申请实施例中提供了用于电子设备的构建向量索引的方法，该方法通过获取向量数据集，并根据向量数据集确定多个聚类及聚类的代表向量，再基于代表向量建立图索引，并对代表向量对应的原始向量进行压缩得到压缩后向量，从而为向量范围检索构建了两级索引，为提高向量范围检索的效率和准确性提供了相应的技术基础。The embodiment of the present application provides a method for constructing a vector index for an electronic device. The method obtains a vector data set, determines a plurality of clusters and representative vectors of the clusters according to the vector data set, and then establishes a graph index based on the representative vector. , and compress the original vector corresponding to the representative vector to obtain a compressed vector, thus constructing a two-level index for vector range retrieval, which provides a corresponding technical basis for improving the efficiency and accuracy of vector range retrieval.

第三方面，本申请实施例提供了一种向量范围检索的装置，该装置包括：In a third aspect, an embodiment of the present application provides an apparatus for vector range retrieval, the apparatus comprising:

获取模块，用于获取查询向量和检索半径；The acquisition module is used to obtain the query vector and retrieval radius;

加载模块，用于将图索引和压缩后向量从第一存储区加载到第二存储区，第一存储区的读/写速度小于第二存储区；a loading module for loading the graph index and the compressed vector from the first storage area to the second storage area, and the read/write speed of the first storage area is lower than that of the second storage area;

目标代表向量确定模块，用于在第二存储区至少根据查询向量和检索半径，确定图索引中与查询向量距离最接近的至少一个目标代表向量；A target representative vector determination module, configured to determine at least one target representative vector that is closest to the query vector in the graph index according to at least the query vector and the retrieval radius in the second storage area;

目标压缩后向量确定模块，用于在第二存储区获取至少一个目标代表向量对应的压缩后向量，并根据查询向量与压缩后向量之间的距离确定目标压缩后向量；The target compressed vector determination module is used to obtain the compressed vector corresponding to at least one target representative vector in the second storage area, and determine the target compressed vector according to the distance between the query vector and the compressed vector;

目标原始向量确定模块，用于根据目标压缩后向量，从第一存储区获取目标压缩后向量对应的原始向量，并根据查询向量与原始向量之间的距离确定目标原始向量。The target original vector determination module is used to obtain the original vector corresponding to the target compressed vector from the first storage area according to the target compressed vector, and determine the target original vector according to the distance between the query vector and the original vector.

第四方面，本申请实施例提供了一种构建向量索引的装置，该装置包括：In a fourth aspect, an embodiment of the present application provides an apparatus for constructing a vector index, the apparatus comprising:

获取模块，用于获取向量数据集；The acquisition module is used to acquire the vector dataset;

聚类模块，用于根据向量数据集确定多个聚类及聚类的代表向量；The clustering module is used to determine a plurality of clusters and the representative vectors of the clusters according to the vector data set;

图索引创建模块，用于基于代表向量建立图索引；The graph index creation module is used to establish graph index based on the representative vector;

压缩模块，用于获取代表向量对应的原始向量，并对原始向量进行压缩得到压缩后向量。The compression module is used to obtain the original vector corresponding to the representative vector, and compress the original vector to obtain the compressed vector.

第五方面，本申请实施例提供了一种电子设备，该电子设备包括：In a fifth aspect, an embodiment of the present application provides an electronic device, the electronic device comprising:

存储器，用于存储由电子设备的一个或多个处理器执行的指令，以及memory for storing instructions for execution by one or more processors of the electronic device, and

处理器，是电子设备的处理器之一，用于执行上述第一方面以及第一方面的各种可能实现中的任意一种向量范围检索的方法或执行上述第二方面以及第二方面的各种可能实现中的任意一种构建向量索引的方法。The processor, which is one of the processors of the electronic device, is configured to execute any one of the above-mentioned first aspect and various possible implementations of the vector range retrieval method, or to execute the above-mentioned second aspect and each of the second aspect. Any of the possible implementations of constructing vector indices.

第六方面，本申请实施例提供了一种可读存储介质，可读存储介质上存储有指令，该指令在电子设备上执行时使电子设备执行上述第一方面以及第一方面的各种可能实现中的任意一种向量范围检索的方法或执行上述第二方面以及第二方面的各种可能实现中的任意一种构建向量索引的方法。In a sixth aspect, an embodiment of the present application provides a readable storage medium, where an instruction is stored on the readable storage medium, and when the instruction is executed on an electronic device, the electronic device executes the above-mentioned first aspect and various possibilities of the first aspect Implement any one of the methods for vector range retrieval or perform any one of the above-mentioned second aspect and any of the various possible implementations of the second aspect to construct a vector index.

第七方面，本申请实施例提供了一种计算机程序产品，包括计算机程序/指令，其特征在于，该计算机程序/指令被处理器执行时实现上述第一方面以及第一方面的各种可能实现中的任意一种向量范围检索的方法或执行上述第二方面以及第二方面的各种可能实现中的任意一种构建向量索引的方法。In a seventh aspect, an embodiment of the present application provides a computer program product, including a computer program/instruction, characterized in that, when the computer program/instruction is executed by a processor, the above-mentioned first aspect and various possible implementations of the first aspect are implemented Any one of the vector range retrieval methods in the above or any one of the above-mentioned second aspect and any of the various possible implementations of the second aspect to construct a vector index.

附图说明Description of drawings

图1根据本申请的一些实施例，示出了一种近似向量查询方法的查询过程示意图。FIG. 1 shows a schematic diagram of a query process of an approximate vector query method according to some embodiments of the present application.

图2根据本申请的一些实施例，示出了一种中心点与倒排文件的对应关系示意图。FIG. 2 shows a schematic diagram of the correspondence between a center point and an inverted file according to some embodiments of the present application.

图3根据本申请的一些实施例，示出了一种向量范围检索的方法的场景示意图。Fig. 3 shows a schematic diagram of a scene of a method for vector range retrieval according to some embodiments of the present application.

图4根据本申请的一些实施例，示出了一种电子设备的硬件结构图。FIG. 4 shows a hardware structure diagram of an electronic device according to some embodiments of the present application.

图5根据本申请的一些实施例，示出了一种构建一级索引和二级索引的流程示意图。FIG. 5 shows a schematic flowchart of building a primary index and a secondary index according to some embodiments of the present application.

图6根据本申请的一些实施例，示出了一种根据分层可导航小世界构建的图索引示意图。Fig. 6 shows a schematic diagram of a graph index constructed according to a hierarchical navigable small world according to some embodiments of the present application.

图7根据本申请的一些实施例，示出了一种根据乘积量化算法对原始向量进行压缩的示意图。FIG. 7 shows a schematic diagram of compressing an original vector according to a product quantization algorithm according to some embodiments of the present application.

图8根据本申请的一些实施例，示出了一种向量范围检索的方法的流程示意图。FIG. 8 shows a schematic flowchart of a method for vector range retrieval according to some embodiments of the present application.

图9根据本申请的一些实施例，示出了一种服务器加载图索引和压缩后向量的示意图。FIG. 9 shows a schematic diagram of a server loading a graph index and a compressed vector according to some embodiments of the present application.

图10根据本申请的一些实施例，示出了一种向量范围检索的装置的结构示意图。Fig. 10 shows a schematic structural diagram of an apparatus for vector range retrieval according to some embodiments of the present application.

图11根据本申请的一些实施例，示出了一种构建向量索引的装置的结构示意图。FIG. 11 shows a schematic structural diagram of an apparatus for constructing a vector index according to some embodiments of the present application.

具体实施方式Detailed ways

本申请的说明性实施例包括但不限于向量范围检索的方法、装置、设备、介质及程序产品。Illustrative embodiments of the present application include, but are not limited to, methods, apparatus, devices, media, and program products for vector range retrieval.

以下，对本申请中的部分用语进行解释说明，以便于本领域技术人员理解：Hereinafter, some terms in this application will be explained so as to facilitate the understanding of those skilled in the art:

向量：在不同的学科中有不同的定义，在计算机学科中是指由多个数字组成的序列，数字的数量为向量的维度，例如由N个数字组成的一个序列为N维向量。Vector: There are different definitions in different disciplines. In computer science, it refers to a sequence composed of multiple numbers. The number of numbers is the dimension of the vector. For example, a sequence composed of N numbers is an N-dimensional vector.

向量范围检索：又称向量范围搜索，是指在给定的向量数据集中按照某种度量方式检索与查询向量的相似性在指定的检索半径内的多个目标向量，向量范围检索返回的目标向量的数目根据查询向量和检索半径的不同存在差异。Vector range retrieval: Also known as vector range search, it refers to the retrieval of multiple target vectors whose similarity to the query vector is within a specified retrieval radius in a given vector data set according to a certain metric, and the target vector returned by vector range retrieval The number varies depending on the query vector and retrieval radius.

图索引：通过图算法创建的索引，表现为图的形式，包括节点和节点之间的边，节点对应索引数据，有关联关系的节点之间存在边。在检索时，首先将查询向量在索引数据构成的图中进行检索。Graph index: An index created by a graph algorithm, expressed in the form of a graph, including edges between nodes, nodes corresponding to index data, and edges between associated nodes. During retrieval, the query vector is first retrieved in the graph composed of index data.

向量量化：是对向量进行压缩的有损数据压缩方法，通过对原始向量进行向量量化，可以减少原始向量占用的存储空间，提高向量加载的速度。Vector quantization: It is a lossy data compression method for compressing vectors. By performing vector quantization on the original vector, the storage space occupied by the original vector can be reduced and the speed of vector loading can be improved.

特征向量提取：通过将非结构化数据如文本、图片、音频、视频等的每一种特征抽象为一个属性，每个属性对应一个或多个具体数值，再用所有属性对应的多维特征向量表示该非结构化数据。针对不同类型的非结构化数据进行特征向量提取，可以使用不同的方法，例如对文本进行特征向量提取可以使用文本分词并将单词转换为向量的方法，对图片进行特征向量提取可以使用通过深度神经网络将图片转换为向量的方法等。Feature vector extraction: by abstracting each feature of unstructured data such as text, pictures, audio, video, etc. into an attribute, each attribute corresponds to one or more specific values, and then represented by multi-dimensional feature vectors corresponding to all attributes the unstructured data. For feature vector extraction of different types of unstructured data, different methods can be used, such as text segmentation and word conversion into vectors for text feature vector extraction, and image feature vector extraction. How the network converts pictures to vectors, etc.

倒排索引(Inverted Files Index，IVF)：也称为反向索引、置入档案或反向档案，是一种索引方法，通常包括索引向量和倒排文件，可以根据索引向量快速获取索引向量对应的倒排文件。作为近似最近邻检索中常用的一种算法，向量空间通过聚类算法如K-Means划分为多个聚类(Cluster)，每个聚类有各自的中心点向量，在进行查询时，查询向量分别计算与多个聚类的中心点向量的距离并在距离较小的可能为多个的聚类中查找最接近的向量。如图1所示，聚类后的向量空间包括聚类1和聚类2，聚类1的中心点向量为中心点1，聚类2的中心点向量为中心点2，进行查询时，查询向量分别计算与中心点1和中心点2的距离，然后在距离最小的聚类1中查找，得到的查询结果为最接近向量1。具体来说，向量空间中每个聚类的中心点与一个倒排文件对应，聚类中的向量存储在倒排文件中。Inverted Files Index (IVF): also known as reverse index, insert file or reverse file, is an indexing method, usually including index vector and inverted file, can quickly obtain the corresponding index vector according to the index vector inverted file. As an algorithm commonly used in approximate nearest neighbor retrieval, the vector space is divided into multiple clusters by clustering algorithms such as K-Means, and each cluster has its own center point vector. When querying, the query vector Calculate the distances to the center point vectors of multiple clusters separately and find the closest vector in the possibly multiple clusters with smaller distances. As shown in Figure 1, the clustered vector space includes cluster 1 and cluster 2. The center point vector of cluster 1 is center point 1, and the center point vector of cluster 2 is center point 2. When querying, the query The vector calculates the distance from the center point 1 and the center point 2 respectively, and then searches in the cluster 1 with the smallest distance, and the query result obtained is the closest vector 1. Specifically, the center point of each cluster in the vector space corresponds to an inverted file, and the vectors in the clusters are stored in the inverted file.

倒排文件：在持久化存储设备上存储的文件，用于存储多个向量。如图2所示，聚类1的中心点向量[-3.2,1.7,2.4,…,0,2.1]与倒排文件1对应，聚类2的中心点向量[1.5,-0.3,3.9,…,-2.8,0.1]与倒排文件2对应，查询向量在聚类1中查找相似向量时，是在聚类1对应的倒排文件1中实施查找。Inverted file: A file stored on persistent storage to store multiple vectors. As shown in Figure 2, the center point vector of cluster 1 [-3.2,1.7,2.4,…,0,2.1] corresponds to the inverted file 1, and the center point vector of cluster 2 [1.5,-0.3,3.9,… ,-2.8,0.1] corresponds to the inverted file 2. When the query vector searches for a similar vector in cluster 1, the search is performed in the inverted file 1 corresponding to cluster 1.

聚类算法：一种无监督的算法，用于通过识别出数据间的关系，将多个数据所组成的集合分为若干子集(或类、簇或区等)。有的聚类算法可以指定子集的数量，例如K均值聚类算法(K-means Clustering Algorithm，K-means)或基于随机选择的聚类算法(CLARANS)等，有的聚类算法不需要指定子集的数量，算法根据自身原理和数据情况推算子集数量，例如基于密度的噪声应用空间聚类(Density-Based Spatial Clustering of Applicationswith Noise，DBSCAN)、基于层次结构的均衡迭代还原与聚类(Balanced IterativeReducing and Clustering using Hierarchies，BIRCH)等。Clustering Algorithm: An unsupervised algorithm used to divide a collection of multiple data into subsets (or classes, clusters, or regions, etc.) by identifying relationships among the data. Some clustering algorithms can specify the number of subsets, such as K-means Clustering Algorithm (K-means) or clustering algorithm based on random selection (CLARANS), etc. Some clustering algorithms do not need to be specified. The number of subsets, the algorithm calculates the number of subsets according to its own principles and data conditions, such as density-based spatial clustering of applications with noise (Density-Based Spatial Clustering of Applications with Noise, DBSCAN), hierarchical structure-based balanced iterative reduction and clustering ( Balanced IterativeReducing and Clustering using Hierarchies, BIRCH) and so on.

可以理解，在本申请各实施例中，处理器可以是微处理器、数字信号处理器、微控制器等，和/或其任何组合。根据另一个方面，所述处理器可以是单核处理器，多核处理器等，和/或其任何组合。It can be understood that, in various embodiments of the present application, the processor may be a microprocessor, a digital signal processor, a microcontroller, etc., and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, etc., and/or any combination thereof.

可以理解，本申请的向量范围检索的方法适用于对用户提交的非结构化数据和检索半径进行向量范围检索的场景。It can be understood that the vector range retrieval method of the present application is suitable for the scenario of performing vector range retrieval on unstructured data and retrieval radius submitted by the user.

下面将结合附图对本申请的实施例作进一步地详细描述。The embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

图3根据本申请的一些实施例，提供了一种进行向量范围检索的场景。如图3所示，该场景包括查询组件100、数据存储设备200和客户端300。查询组件100中包括图索引检索模块101、压缩向量检索模块102和原始向量检索模块103。FIG. 3 provides a scenario for performing vector range retrieval according to some embodiments of the present application. As shown in FIG. 3 , the scenario includes a query component 100 , a data storage device 200 and a client 300 . The query component 100 includes a graph index retrieval module 101 , a compressed vector retrieval module 102 and an original vector retrieval module 103 .

用户通过客户端300提供的检索界面输入查询数据和检索半径，检索半径用于描述检索结果与查询数据之间的相似度，具体来说，检索结果与查询数据之间的距离越小说明相似度越高。The user inputs the query data and the retrieval radius through the retrieval interface provided by the client 300. The retrieval radius is used to describe the similarity between the retrieval result and the query data. Specifically, the smaller the distance between the retrieval result and the query data, the similarity is. higher.

可以理解，查询数据可以是非结构化数据，也可以是查询向量。非结构化数据可以包括：文本数据、图片数据、音频数据和视频数据等。查询向量是高维向量数据，用于描述用户期望进行范围检索的信息。本申请实施例对查询数据的类型不做具体限制。It can be understood that the query data can be unstructured data or a query vector. Unstructured data may include: text data, picture data, audio data, and video data. Query vectors are high-dimensional vector data that describe the information users expect for range retrieval. The embodiments of the present application do not specifically limit the types of query data.

客户端300接收用户输入的查询数据和检索半径，将查询数据和检索半径发送给后端的查询组件100进行相应处理，并接收查询组件100返回的结果数据并向用户展示。The client 300 receives the query data and retrieval radius input by the user, sends the query data and retrieval radius to the query component 100 at the back end for corresponding processing, and receives the result data returned by the query component 100 and displays it to the user.

可以理解，客户端300接收到的查询数据如果为非结构化数据，客户端300可以将非结构化数据直接发送给查询组件100，也可以首先识别非结构化数据的类型，并根据非结构化数据的类型将非结构化数据转换为查询向量，再将转换后的查询向量发送给查询组件100。本申请实施例对客户端300是否进行非结构化数据的转换不做具体限制。It can be understood that if the query data received by the client 300 is unstructured data, the client 300 can directly send the unstructured data to the query component 100, or can first identify the type of the unstructured data, and determine the type of the unstructured data according to the unstructured data. The type of data converts the unstructured data into a query vector, and then sends the converted query vector to the query component 100 . This embodiment of the present application does not specifically limit whether the client 300 performs unstructured data conversion.

查询组件100用于通过网络接收客户端300发送的查询数据和检索半径，并根据查询数据确定相应的查询向量，再根据查询向量和检索半径进行相应的范围检索，确定检索的结果向量。The query component 100 is configured to receive the query data and retrieval radius sent by the client 300 through the network, determine a corresponding query vector according to the query data, and then perform a corresponding range retrieval according to the query vector and retrieval radius to determine the retrieval result vector.

可以理解，查询组件100接收的查询数据是查询向量，则查询组件100直接根据查询向量和检索半径进行范围检索；如果查询数据是非结构化数据，则查询组件100首先识别非结构化数据的类型，并根据非结构化数据的类型将非结构化数据转换为查询向量，再根据得到查询向量和检索半径进行范围检索。本申请实施例对查询组件100是否进行查询向量的转换不做具体限制。It can be understood that if the query data received by the query component 100 is a query vector, the query component 100 directly performs range retrieval according to the query vector and the retrieval radius; if the query data is unstructured data, the query component 100 first identifies the type of unstructured data, And according to the type of unstructured data, the unstructured data is converted into a query vector, and then the range retrieval is carried out according to the obtained query vector and retrieval radius. This embodiment of the present application does not specifically limit whether the query component 100 performs query vector conversion.

查询组件100在进行范围检索时，首先将查询向量和检索半径传递给图索引检索模块101，图索引检索模块101根据查询向量和检索半径在预先建立的图索引中进行检索，确定与查询向量的相似度在与检索半径相关的范围内的目标索引向量。图索引检索模块101将得到的目标索引向量传递给压缩向量检索模块102。压缩向量检索模块102根据目标索引向量确定对应的压缩向量，并根据查询向量和检索半径确定满足范围检索条件的目标压缩向量。压缩向量检索模块102将得到的目标压缩向量传递给原始向量检索模块103。原始向量检索模块103根据目标压缩向量确定对应的原始向量，并根据查询向量和检索半径确定满足范围检索条件的目标原始向量，目标原始向量即为向量范围检索的结果。When the query component 100 performs range retrieval, it first transmits the query vector and the retrieval radius to the graph index retrieval module 101. The graph index retrieval module 101 performs retrieval in the pre-established graph index according to the query vector and retrieval radius, and determines the relationship between the query vector and the query vector. A vector of target indices whose similarity is in the range relative to the retrieval radius. The graph index retrieval module 101 transmits the obtained target index vector to the compressed vector retrieval module 102 . The compression vector retrieval module 102 determines the corresponding compression vector according to the target index vector, and determines the target compression vector satisfying the range retrieval condition according to the query vector and the retrieval radius. The compressed vector retrieval module 102 transmits the obtained target compressed vector to the original vector retrieval module 103 . The original vector retrieval module 103 determines the corresponding original vector according to the target compression vector, and determines the target original vector that satisfies the range retrieval condition according to the query vector and the retrieval radius, and the target original vector is the result of the vector range retrieval.

可以理解，查询组件100可以将目标原始向量作为结果数据返回给客户端300，可以将目标原始向量的标识返回给客户端300，也可以将目标原始向量转换为相应的非结构化数据后返回给客户端300，本申请实施例对此不做具体限制。It can be understood that the query component 100 can return the target raw vector as result data to the client 300, can return the identifier of the target raw vector to the client 300, or can convert the target raw vector into corresponding unstructured data and return it to the client 300. The client 300 is not specifically limited in this embodiment of the present application.

另外，查询组件100还用于为数据存储设备200中存储的向量数据预先建立一级索引和二级索引，进行向量范围检索时通过使用一级索引和二级索引来提高向量范围检索的速度和效率。在此，一级索引是通过对向量数据进行聚类后得到的中心点向量，二级索引是在一级索引的基础上建立的图索引，图索引是在中心点向量的基础上通过图算法构建的索引。此外，查询组件100还通过向量量化方法对向量数据中的原始向量进行了压缩，能够节省存储空间并提高进行范围检索时的计算效率。In addition, the query component 100 is further configured to pre-establish a primary index and a secondary index for the vector data stored in the data storage device 200, and use the primary index and the secondary index during vector range retrieval to improve the speed and speed of vector range retrieval. efficiency. Here, the primary index is the center point vector obtained by clustering the vector data, the secondary index is the graph index established on the basis of the primary index, and the graph index is the graph algorithm based on the center point vector. built index. In addition, the query component 100 also compresses the original vector in the vector data by using the vector quantization method, which can save storage space and improve the computational efficiency during range retrieval.

数据存储设备200用于存储海量的向量数据，存储的向量数据通过预先向量转换的方式对海量非结构化数据进行处理后得到，向量数据可以包括原始向量和对原始向量压缩后的压缩向量。The data storage device 200 is used for storing massive vector data. The stored vector data is obtained by processing massive unstructured data by means of vector conversion in advance. The vector data may include original vectors and compressed vectors after compressing the original vectors.

可以理解，上述查询组件100和客户端300可以是两个分离的电子设备，也可以是一个电子设备，例如任意一个具体形式的终端。上述查询组件100和数据存储设备200可以是两个分离的电子设备，也可以是一个电子设备，例如一个服务器，本申请实施例对此不做具体限制。It can be understood that the above query component 100 and the client 300 may be two separate electronic devices, or may be one electronic device, such as a terminal in any specific form. The query component 100 and the data storage device 200 may be two separate electronic devices, or may be one electronic device, such as a server, which is not specifically limited in this embodiment of the present application.

本申请技术方案提供的方法通过对向量数据进行聚类得到作为一级索引的中心点向量，再在一级索引的基础上建立作为二级索引的图索引，并对中心点向量对应的原始向量通过向量量化进行压缩；在进行向量范围检索时，先根据查询向量和检索半径在二级索引中检索，确定目标中心点向量，再根据目标中心点向量在一级索引中检索，确定对应的目标压缩向量，最后对目标压缩向量对应的原始向量进行检索，确定目标原始向量并将目标原始向量作为检索结果返回，从而能够提高向量范围检索的速度和效率，同时提高向量范围检索的准确率。The method provided by the technical solution of the present application obtains a center point vector as a primary index by clustering vector data, then establishes a graph index as a secondary index on the basis of the primary index, and analyzes the original vector corresponding to the center point vector. Compression is performed by vector quantization; when searching for a vector range, first search in the secondary index according to the query vector and retrieval radius to determine the target center point vector, and then search in the primary index according to the target center point vector to determine the corresponding target Compress the vector, and finally retrieve the original vector corresponding to the target compressed vector, determine the target original vector, and return the target original vector as the retrieval result, which can improve the speed and efficiency of vector range retrieval, and improve the accuracy of vector range retrieval.

图4根据本申请的一些实施例，示出了一种用于向量范围检索的方法的电子设备20的硬件结构框图。在图4所示的实施例中，电子设备20可以包括一个或多个处理器201，与处理器201中的至少一个连接的系统控制逻辑202，与系统控制逻辑202连接的系统内存203，与系统控制逻辑202连接的非易失性存储器(Non-Volatile Memory,NVM)204，以及与系统控制逻辑202连接的网络接口206。FIG. 4 shows a block diagram of the hardware structure of an electronic device 20 for a method for vector range retrieval according to some embodiments of the present application. In the embodiment shown in FIG. 4 , the electronic device 20 may include one or more processors 201 , a system control logic 202 connected to at least one of the processors 201 , a system memory 203 connected to the system control logic 202 , and A non-volatile memory (NVM) 204 is connected to the system control logic 202 , and a network interface 206 is connected to the system control logic 202 .

在一些实施例中，处理器201可以包括一个或多个单核或多核处理器。在一些实施例中，处理器201可以包括通用处理器和专用处理器(例如，图形处理器，应用处理器，基带处理器等)的任意组合。在电子设备20采用增强型基站(Evolved Node B，eNB)或无线接入网(Radio Access Network，RAN)控制器的实施例中，处理器201可以被配置为执行各种符合的实施例。例如，处理器201可以用于实现向量范围检索的方法。In some embodiments, processor 201 may include one or more single-core or multi-core processors. In some embodiments, processor 201 may comprise any combination of general purpose processors and special purpose processors (eg, graphics processors, application processors, baseband processors, etc.). In embodiments where the electronic device 20 employs an enhanced base station (Evolved Node B, eNB) or radio access network (Radio Access Network, RAN) controller, the processor 201 may be configured to perform various conforming embodiments. For example, the processor 201 may be used to implement a method of vector range retrieval.

在一些实施例中，系统控制逻辑202可以包括任意合适的接口控制器，以向处理器201中至少一个与系统控制逻辑202通信的、任意合适的设备或组件提供任意合适的接口。In some embodiments, system control logic 202 may include any suitable interface controller to provide any suitable interface to at least one of processor 201 with any suitable device or component in communication with system control logic 202 .

在一些实施例中，系统控制逻辑202可以包括一个或多个存储器控制器，以提供连接到系统内存203的接口。系统内存203可以用于加载以及存储数据和/或指令。例如，系统内存203可以加载本申请实施例中的图索引和压缩向量，也可以保存接收的查询向量等。In some embodiments, system control logic 202 may include one or more memory controllers to provide an interface to system memory 203 . System memory 203 may be used to load as well as store data and/or instructions. For example, the system memory 203 may load the graph index and the compression vector in the embodiment of the present application, and may also store the received query vector and the like.

在一些实施例中电子设备20的系统内存203可以包括任意合适的易失性存储器，例如合适的动态随机存取存储器(Dynamic Random Access Memory，DRAM)。In some embodiments, the system memory 203 of the electronic device 20 may include any suitable volatile memory, such as a suitable dynamic random access memory (DRAM).

NVM存储器204可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性的计算机可读介质。在一些实施例中，NVM存储器204可以包括闪存等任意合适的非易失性存储器和/或任意合适的非易失性存储设备，例如硬盘驱动器(Hard Disk Drive，HDD)，光盘(Compact Disc，CD)驱动器，数字通用光盘(Digital Versatile Disc，DVD)驱动器中的至少一个。在本申请实施例中，NVM存储器204可以用于存储海量的向量数据。NVM memory 204 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM memory 204 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as a hard disk drive (Hard Disk Drive, HDD), a compact disc (Compact Disc, At least one of a CD) drive and a Digital Versatile Disc (DVD) drive. In this embodiment of the present application, the NVM memory 204 may be used to store massive amounts of vector data.

NVM存储器204可以包括安装电子设备20的装置上的一部分存储资源，或者它可以由设备访问，但不一定是设备的一部分。例如，可以经由网络接口206通过网络访问NVM存储器204。NVM memory 204 may comprise a portion of storage resources on the apparatus in which electronic device 20 is installed, or it may be accessible by, but not necessarily be, part of the device. For example, NVM memory 204 may be accessed over a network via network interface 206 .

特别地，系统内存203和NVM存储器204可以分别包括：指令205的暂时副本和永久副本。指令205可以包括：由处理器201中的至少一个执行时导致电子设备20实施如图2所示的方法的指令。在一些实施例中，指令205、硬件、固件和/或其软件组件可另外地/替代地置于系统控制逻辑202，网络接口206和/或处理器201中。In particular, system memory 203 and NVM memory 204 may include temporary and permanent copies of instructions 205, respectively. The instructions 205 may include instructions that when executed by at least one of the processors 201 cause the electronic device 20 to implement the method shown in FIG. 2 . In some embodiments, instructions 205 , hardware, firmware and/or software components thereof may additionally/alternately reside in system control logic 202 , network interface 206 and/or processor 201 .

网络接口206可以包括收发器，用于为电子设备20提供无线电接口，进而通过一个或多个网络与任意其他合适的设备(如前端模块，天线等)进行通信。在一些实施例中，网络接口206可以集成于电子设备20的其他组件。例如，网络接口206可以集成于处理器201的，系统内存203，NVM存储器204，和具有指令的固件设备(未示出)中的至少一种，当处理器201中的至少一个执行所述指令时，电子设备20实现如方法实施例中示出的方法。在本申请实施例中，网络接口206可以用于接收客户端发送的非结构化数据、检索半径和接收的结果数据等。Network interface 206 may include a transceiver for providing a radio interface for electronic device 20 to communicate with any other suitable device (eg, front-end modules, antennas, etc.) over one or more networks. In some embodiments, network interface 206 may be integrated with other components of electronic device 20 . For example, network interface 206 may be integrated into at least one of processor 201, system memory 203, NVM memory 204, and a firmware device (not shown) having instructions that, when at least one of processors 201 executes the instructions , the electronic device 20 implements the method as shown in the method embodiment. In this embodiment of the present application, the network interface 206 may be used to receive unstructured data sent by the client, retrieval radius, received result data, and the like.

网络接口206可以进一步包括任意合适的硬件和/或固件，以提供多输入多输出无线电接口。例如，网络接口206可以是网络适配器，无线网络适配器，电话调制解调器和/或无线调制解调器。The network interface 206 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 206 may be a network adapter, wireless network adapter, telephone modem, and/or wireless modem.

在一些实施例中，处理器201中的至少一个可以与用于系统控制逻辑202的一个或多个控制器的逻辑封装在一起，以形成系统封装(System In a Package，SiP)。在一些实施例中，处理器201中的至少一个可以与用于系统控制逻辑202的一个或多个控制器的逻辑集成在同一管芯上，以形成片上系统(System on Chip，SoC)。In some embodiments, at least one of the processors 201 may be packaged with logic for one or more controllers of the system control logic 202 to form a System In a Package (SiP). In some embodiments, at least one of the processors 201 may be integrated on the same die with logic for one or more controllers of the system control logic 202 to form a System on Chip (SoC).

电子设备20可以进一步包括：输入/输出(I/O)设备207。I/O设备207可以包括用户界面，使得用户能够与电子设备20进行交互；外围组件接口的设计使得外围组件也能够与电子设备20交互。在一些实施例中，电子设备20还包括传感器，用于确定与电子设备20相关的环境条件和位置信息的至少一种。The electronic device 20 may further include an input/output (I/O) device 207 . I/O device 207 may include a user interface that enables a user to interact with electronic device 20 ; the peripheral component interface is designed to enable peripheral components to interact with electronic device 20 as well. In some embodiments, the electronic device 20 also includes sensors for determining at least one of environmental conditions and location information associated with the electronic device 20 .

在一些实施例中，用户界面可包括但不限于显示器(例如，液晶显示器，触摸屏显示器等)，扬声器，麦克风，一个或多个相机(例如，静止图像照相机和/或摄像机)，手电筒(例如，发光二极管闪光灯)和键盘。In some embodiments, the user interface may include, but is not limited to, a display (eg, a liquid crystal display, a touch screen display, etc.), a speaker, a microphone, one or more cameras (eg, a still image camera and/or video camera), a flashlight (eg, a LED flash) and keyboard.

在一些实施例中，外围组件接口可以包括但不限于非易失性存储器端口、音频插孔和电源接口。In some embodiments, peripheral component interfaces may include, but are not limited to, non-volatile memory ports, audio jacks, and power connectors.

在一些实施例中，传感器可包括但不限于陀螺仪传感器，加速度计，近程传感器，环境光线传感器和定位单元。定位单元还可以是网络接口206的一部分或与网络接口206交互，以与定位网络的组件(例如，北斗卫星)进行通信。In some embodiments, sensors may include, but are not limited to, gyroscope sensors, accelerometers, proximity sensors, ambient light sensors, and positioning units. The positioning unit may also be part of or interact with the network interface 206 to communicate with components of the positioning network (eg, BeiDou satellites).

可以理解的是，图4示意的结构并不构成对电子设备20的具体限定。在本申请另外一些实施例中电子设备20可以包括比图示更多或更少的部件，或者组合某些部件，或者拆分某些部件，或者不同的部件布置。图示的部件可以由硬件或软件，或软件和硬件的组合实现。It can be understood that the structure shown in FIG. 4 does not constitute a specific limitation on the electronic device 20 . In other embodiments of the present application, the electronic device 20 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware or software, or a combination of software and hardware.

下面根据图5至图8并结合具体场景，详细介绍本申请的技术方案。The technical solutions of the present application will be described in detail below according to FIGS. 5 to 8 in combination with specific scenarios.

以下对为向量数据集建立向量索引的具体过程进行解释说明。图5示出了根据本申请的电子设备实施的向量范围检索的方法中一级索引和二级索引构建流程的示意图。如图5所示，为向量数据集建立一级索引和二级索引可以包括如下步骤：The specific process of establishing a vector index for a vector data set is explained below. FIG. 5 shows a schematic diagram of the construction flow of the primary index and the secondary index in the method for vector range retrieval implemented by the electronic device of the present application. As shown in Figure 5, establishing a primary index and a secondary index for a vector data set may include the following steps:

步骤S501：确定向量数据集。Step S501: Determine the vector data set.

在此，预先对非结构化数据进行向量转换，可以得到由海量向量组成的集合，向量的维度通常为几百维，例如128维，512维等。向量中每个维度的数据表示为浮点数，通常为64位的浮点数，每个维度的数据占用的存储空间为8个字节。Here, vector transformation is performed on the unstructured data in advance, and a set composed of massive vectors can be obtained, and the dimensions of the vectors are usually several hundred dimensions, such as 128 dimensions, 512 dimensions, and so on. The data of each dimension in the vector is represented as a floating-point number, usually a 64-bit floating-point number, and the storage space occupied by the data of each dimension is 8 bytes.

海量向量组成的向量集合为向量空间，其中向量的数量可以从百万级到十亿乃至百亿级，通过为向量空间建立索引，能够极大提高向量范围检索时的效率。A vector set composed of massive vectors is a vector space, in which the number of vectors can range from millions to billions or even tens of billions. By indexing the vector space, the efficiency of vector range retrieval can be greatly improved.

可以理解，为向量空间建立索引时，可以对向量数量较大的向量空间如十亿或百亿级的向量空间进行采样得到向量数据集，根据采样的向量数据集为整个向量空间建立索引；也可以将向量数量较小的向量空间如百万级或千万级的向量空间直接作为向量数据集，根据向量数据集建立索引。本申请实施例对此不做具体限制。It can be understood that when creating an index for a vector space, a vector data set can be obtained by sampling a vector space with a large number of vectors, such as a vector space of billions or tens of billions, and an index is established for the entire vector space according to the sampled vector data set; A vector space with a small number of vectors, such as a million-level or ten-million-level vector space, can be directly used as a vector data set, and an index can be established based on the vector data set. This embodiment of the present application does not specifically limit this.

在一些实施例中，根据向量数据集建立索引，可以通过聚类算法将向量数据集划分为多个向量子集合，每个向量子集合中的向量在某些属性上具有相近的性质。In some embodiments, the index is established according to the vector data set, and the vector data set can be divided into a plurality of vector subsets through a clustering algorithm, and the vectors in each vector subset have similar properties in certain attributes.

在一些实施例中，将整个向量空间确定为向量数据集，即将向量空间中所有向量作为向量数据集中的向量，每个向量都参与聚类算法的分类计算，适用于向量空间中向量的数量级较小的情况，聚类结果比较准确，确定的索引向量的准确性更高。In some embodiments, the entire vector space is determined as a vector data set, that is, all vectors in the vector space are regarded as vectors in the vector data set, and each vector participates in the classification calculation of the clustering algorithm, which is suitable for the order of magnitude comparison of vectors in the vector space. If it is small, the clustering result is more accurate, and the accuracy of the determined index vector is higher.

在一些实施例中，对整个向量空间进行抽样得到向量数据集，即抽取向量空间中部分向量作为向量数据集中的向量，只有向量空间中部分向量会参与聚类算法的分类计算，适用于向量空间中向量的数量级较大的情况，由于是基于部分向量得到的分类结果，存在聚类不准确的情况，确定的索引向量的准确性有一定程度的下降。In some embodiments, the entire vector space is sampled to obtain a vector data set, that is, some vectors in the vector space are extracted as vectors in the vector data set, and only some of the vectors in the vector space will participate in the classification calculation of the clustering algorithm, which is suitable for vector space. In the case where the magnitude of the medium vector is large, because the classification result is obtained based on part of the vector, the clustering is inaccurate, and the accuracy of the determined index vector is reduced to a certain extent.

在一些实施例中，对向量空间中的向量进行抽样可以采用不同的抽样方法，例如随机抽样、分层抽样、整体抽样、系统抽样等。本申请实施例对抽样方法不做具体限制。In some embodiments, different sampling methods can be used for sampling the vectors in the vector space, such as random sampling, stratified sampling, overall sampling, systematic sampling, and the like. The embodiments of the present application do not specifically limit the sampling method.

步骤S502：对向量数据集中向量进行聚类，确定多个聚类。Step S502: Clustering the vectors in the vector data set to determine multiple clusters.

将向量数据集中的向量输入聚类算法，聚类算法对输入的向量进行聚类计算，确定相应的聚类。在聚类算法的聚类计算过程中，每一次迭代输入向量数据集中所有向量，通过计算向量与分类中心点之间的相似度并对向量进行分类，再重新计算每个类别的分类中心点，经过反复迭代以得到满足终止条件的聚类。如图1所示，向量数据集通过聚类算法得到聚类1和聚类2，聚类1的代表向量有中心点1，聚类2的代表向量是中心点2。The vectors in the vector data set are input into the clustering algorithm, and the clustering algorithm performs clustering calculation on the input vectors to determine the corresponding clusters. In the clustering calculation process of the clustering algorithm, all the vectors in the input vector data set are input in each iteration, the similarity between the vectors and the classification center points is calculated and the vectors are classified, and then the classification center points of each category are recalculated. After repeated iterations, clusters that satisfy the termination conditions are obtained. As shown in Figure 1, the vector data set obtains cluster 1 and cluster 2 through the clustering algorithm. The representative vector of cluster 1 has center point 1, and the representative vector of cluster 2 is center point 2.

在一些实施例中，可以使用无需指定聚类数量的聚类算法，例如基于密度的噪声应用空间聚类算法、基于层次结构的均衡迭代还原与聚类算法等。该类聚类算法可以根据自身的计算确定最终划分的聚类数量，从而无需用户的干预。In some embodiments, clustering algorithms that do not require specifying the number of clusters, such as density-based noise-applied spatial clustering algorithms, hierarchical structure-based balanced iterative reduction and clustering algorithms, and the like, may be used. This type of clustering algorithm can determine the final number of clusters according to its own calculation, so that no user intervention is required.

在另一些实施例中，可以使用可指定聚类数量的聚类算法，例如K均值聚类算法、基于随机选择的聚类算法等。该类聚类算法进行聚类计算后，得到的分类的数量为指定的聚类数量，例如指定聚类数量为10，则聚类算法对向量数据集中的向量进行聚类后，所有向量划分到10个分类子集合中。In other embodiments, a clustering algorithm that can specify the number of clusters, such as a K-means clustering algorithm, a random selection-based clustering algorithm, etc., may be used. After this type of clustering algorithm performs clustering calculation, the number of categories obtained is the specified number of clusters. For example, if the specified number of clusters is 10, after the clustering algorithm clusters the vectors in the vector dataset, all vectors are divided into 10 classification sub-collections.

本申请的一些实施例中，使用K-means算法作为聚类算法。K-means算法的执行过程中，首先在向量数据集中确定K个向量作为初始聚类中心，再为向量数据集中剩余的每个向量计算该向量到K个聚类中心的距离，并将该向量划分到与其距离最小的聚类中心对应的类中，剩余向量都分类完毕后，重新计算每个类的聚类中心，将新的聚类中心作为下一次迭代的聚类中心进行重复迭代计算，在满足预设的终止条件后，停止算法的执行。K-means算法使用欧氏距离作为向量之间的距离，欧式距离越小，说明两个向量越相似甚至是相同。K-means算法能够达到局部最优的聚类效果，在处理海量数据集合时能够保证较好的伸缩性，同时算法的复杂度较低，也易于理解。In some embodiments of the present application, the K-means algorithm is used as the clustering algorithm. During the execution of the K-means algorithm, K vectors are first determined in the vector data set as the initial cluster centers, and then the distance from the vector to the K cluster centers is calculated for each remaining vector in the vector data set, and the vector It is divided into the class corresponding to the cluster center with the smallest distance. After the remaining vectors are classified, the cluster center of each class is recalculated, and the new cluster center is used as the cluster center of the next iteration for repeated iteration calculation. After the preset termination condition is met, the execution of the algorithm is stopped. The K-means algorithm uses the Euclidean distance as the distance between vectors. The smaller the Euclidean distance, the more similar or even the same two vectors are. The K-means algorithm can achieve the local optimal clustering effect, and can ensure good scalability when dealing with massive data sets. At the same time, the complexity of the algorithm is low and easy to understand.

步骤S503：确定每个聚类的代表向量。Step S503: Determine the representative vector of each cluster.

对向量数据集中的向量进行聚类后得到多个聚类，每个聚类中包括数量不等的向量，根据每个聚类中的全部向量来确定该聚类的代表向量，该代表向量与同一个聚类中其它向量的距离之和最小。在本申请的实施例中，中心点向量即聚类的代表向量。After clustering the vectors in the vector dataset, multiple clusters are obtained, and each cluster includes different vectors. The representative vector of the cluster is determined according to all the vectors in each cluster. The representative vector is the same as the The sum of the distances of other vectors in the same cluster is the smallest. In the embodiments of the present application, the center point vector is the representative vector of the cluster.

在一些实施例中，聚类的中心点向量可以根据同一个聚类中全部向量的均值确定，中心点向量中每个维度的值为全部向量对应维度的值的算术平均值。例如，某个聚类中包括3个向量，分别为[3.5，7.2，4.7，-2.2，2.5]、[-1.3，4.7，-5.1，3.9，8.5]、[8.6，-0.2，3.4，6.4，-0.5]，则该聚类的中心点向量为[3.6，3.9，1.0，2.7，3.5]。In some embodiments, the center point vector of the cluster may be determined according to the mean value of all vectors in the same cluster, and the value of each dimension in the center point vector is the arithmetic mean of the values of the corresponding dimension of all the vectors. For example, a cluster includes 3 vectors, respectively [3.5, 7.2, 4.7, -2.2, 2.5], [-1.3, 4.7, -5.1, 3.9, 8.5], [8.6, -0.2, 3.4, 6.4 , -0.5], then the center point vector of the cluster is [3.6, 3.9, 1.0, 2.7, 3.5].

可以理解，计算聚类的中心点向量还可以使用其它计算均值的方法，例如几何平均值、均方根平均值、调和平均值、加权平均值等，本申请实施例对中心点向量的确定方法不做具体限制。It can be understood that other methods for calculating the mean value can also be used to calculate the center point vector of the cluster, such as geometric mean, root mean square mean, harmonic mean, weighted mean, etc. The method for determining the center point vector in this embodiment of the present application No specific restrictions are imposed.

另外，聚类的中心点向量可以是真实向量，即聚类中包括的某个具体向量，也可以是虚拟向量，即在聚类中不存在的向量。本申请实施例对中心点向量是真实向量还是虚拟向量不做具体限制。In addition, the center point vector of the cluster may be a real vector, that is, a specific vector included in the cluster, or a virtual vector, that is, a vector that does not exist in the cluster. This embodiment of the present application does not specifically limit whether the center point vector is a real vector or a virtual vector.

作为一个聚类的代表向量，既可以是如上所述的中心点向量，在某些实施例中，代表向量还可以是中心点向量之外的基于其他的预设规则选出的向量。本申请的实施例中，代表向量即为向量空间建立的一级索引。As a representative vector of a cluster, it can be the center point vector as described above. In some embodiments, the representative vector can also be a vector selected based on other preset rules other than the center point vector. In the embodiment of the present application, the representative vector is the first-level index established in the vector space.

步骤S504：确定每个代表向量对应的原始向量。Step S504: Determine the original vector corresponding to each representative vector.

本申请的实施例中，每个聚类的代表向量为该聚类的中心点向量，中心点向量可以对应多个向量空间中的原始向量，通过中心点向量可以对相应的原始向量进行检索。In the embodiment of the present application, the representative vector of each cluster is the center point vector of the cluster, the center point vector may correspond to the original vectors in multiple vector spaces, and the corresponding original vector may be retrieved through the center point vector.

在一些实施例中，代表向量代表了向量数据集中的聚类，将前述步骤中分类该聚类中的原始向量作为该代表向量对应的原始向量。在此，代表向量与对应的原始向量属于同一个聚类，说明该原始向量与该代表向量的相似度高于该原始向量与其它聚类的代表向量的相似度，将该原始向量与属于同一聚类的代表向量进行对应能够更好地发挥索引的作用。In some embodiments, the representative vector represents a cluster in the vector data set, and the original vector in the cluster classified in the foregoing step is used as the original vector corresponding to the representative vector. Here, the representative vector and the corresponding original vector belong to the same cluster, indicating that the similarity between the original vector and the representative vector is higher than the similarity between the original vector and the representative vectors of other clusters, and the original vector belongs to the same Correspondence between the representative vectors of the clusters can better play the role of the index.

在一些实施例中，向量空间就是向量数据集，在前述步骤中向量数据集中的原始向量已经分类到不同的聚类中，并确定了聚类的中心点向量，将每个聚类中的原始向量与相应的中心点向量对应即可确定中心点向量对应的多个原始向量。In some embodiments, the vector space is a vector data set. In the preceding steps, the original vectors in the vector data set have been classified into different clusters, and the center point vectors of the clusters are determined, and the original vectors in each cluster are classified into different clusters. A plurality of original vectors corresponding to the center point vector can be determined by the vector corresponding to the corresponding center point vector.

在一些实施例中，向量数据集是向量空间的抽样子集合，在前述步骤中仅确定了向量数据集中原始向量所属的聚类，因此只完成中心点向量与向量数据集中原始向量的对应。向量空间中剩余的原始向量需要先确定对应的聚类，再与聚类的中心点向量进行对应。具体来说，向量空间中剩余的原始向量可以分别计算其与全部聚类的中心点向量之间的距离，将自身归入与其距离最小的聚类中，从而与该聚类的中心点向量对应。In some embodiments, the vector data set is a sample subset of the vector space, and only the cluster to which the original vector in the vector data set belongs is determined in the foregoing steps, so only the correspondence between the center point vector and the original vector in the vector data set is completed. The remaining original vectors in the vector space need to determine the corresponding clusters first, and then correspond to the center point vectors of the clusters. Specifically, the remaining original vectors in the vector space can calculate the distances between them and the center point vectors of all clusters, and classify themselves into the cluster with the smallest distance from them, so as to correspond to the center point vectors of the cluster. .

可以理解，计算原始向量与多个聚类的中心点向量之间的距离时，使用的距离计算方法与前述步骤中对向量数据集中原始向量进行聚类时使用的距离计算方法一致，从而使得通过聚类算法划分到某个聚类的原始向量与通过与中心点向量之间距离划分到该聚类的原始向量保持较高的相似性。It can be understood that when calculating the distance between the original vector and the center point vectors of multiple clusters, the distance calculation method used is the same as the distance calculation method used when clustering the original vector in the vector data set in the preceding steps, so that through The original vector divided into a cluster by the clustering algorithm maintains a high similarity with the original vector divided into the cluster by the distance from the center point vector.

在一些实施例中，可以将代表向量对应的原始向量存储在倒排文件中。具体来说，每个代表向量可以对应一个倒排文件，倒排文件中存储多个与代表向量对应的原始向量。如图2所示，中心点的集合对应多个倒排文件，其中中心点[-3.2,1.7,2.4,…,0,2.1]对应倒排文件1，与中心点[-3.2,1.7,2.4,…,0,2.1]同属一个聚类的多个原始向量存储在倒排文件1中，中心点[1.5,-0.3,3.9,…,-2.8,0.1]对应倒排文件2，与中心点[1.5,-0.3,3.9,…,-2.8,0.1]同属一个聚类的多个原始向量存储在倒排文件2中。In some embodiments, the original vector corresponding to the representative vector may be stored in the inverted file. Specifically, each representative vector may correspond to an inverted file, and a plurality of original vectors corresponding to the representative vector are stored in the inverted file. As shown in Figure 2, the set of center points corresponds to multiple inverted files, wherein the center point [-3.2, 1.7, 2.4,..., 0, 2.1] corresponds to the inverted file 1, and the center point [-3.2, 1.7, 2.4] ,…,0,2.1] multiple original vectors belonging to the same cluster are stored in the inverted file 1, the center point [1.5,-0.3,3.9,…,-2.8,0.1] corresponds to the inverted file 2, and the center point [1.5,-0.3,3.9,…,-2.8,0.1] Multiple original vectors belonging to the same cluster are stored in inverted file 2.

本申请的一些实施例中，对每个聚类确定该聚类的聚类半径，聚类半径是指该聚类中原始向量与聚类的代表向量之间距离的最大值。通过确定每个聚类的聚类半径，可以确定该聚类中原始向量与作为索引的中心点向量之间的最大距离，从而为根据检索半径进行的范围检索提供相应的距离筛选的支持。In some embodiments of the present application, the cluster radius of the cluster is determined for each cluster, and the cluster radius refers to the maximum distance between the original vector in the cluster and the representative vector of the cluster. By determining the cluster radius of each cluster, the maximum distance between the original vector in the cluster and the center point vector as an index can be determined, thereby providing corresponding distance screening support for range retrieval based on the retrieval radius.

在一些实施例中，根据每个聚类的聚类半径，确定向量空间的最大半径。由于每个聚类的聚类半径不同，将聚类半径中的最大值确定为向量空间的最大半径，在进行范围检索时可以根据最大半径确定相应的检索范围，避免遗漏可能的检索结果向量。In some embodiments, the maximum radius of the vector space is determined from the cluster radius of each cluster. Since the cluster radius of each cluster is different, the maximum value of the cluster radius is determined as the maximum radius of the vector space, and the corresponding search range can be determined according to the maximum radius when performing range retrieval, so as to avoid missing possible search result vectors.

步骤S505：基于代表向量建立对应的图索引。Step S505: Establish a corresponding graph index based on the representative vector.

本申请的实施例中，在得到的每个聚类的代表向量的基础上建立作为二级索引的图索引。具体来说，向量空间中存在海量的原始向量，用作一级索引的中心点向量的数量也非常大，通过为中心点向量建立图索引，能够提高范围检索时索引的检索效率，降低范围检索时的计算时间。In the embodiment of the present application, a graph index serving as a secondary index is established on the basis of the obtained representative vector of each cluster. Specifically, there are a large number of original vectors in the vector space, and the number of center point vectors used as primary indexes is also very large. By establishing a graph index for the center point vector, the retrieval efficiency of the index during range retrieval can be improved, and the range retrieval can be reduced. calculation time.

在一些实施例中，通过分层可导航小世界(Hierarchical Navigable SmallWorlds，HNSW)算法为代表向量建立相应的图索引。构建完成的图索引包括多层可导航小世界，每层的可导航小世界包括若干个代表向量，最底层包括所有的代表向量，越往上层代表向量的数量越少，最上层只有一个作为入口点的代表向量。每层的可导航小世界中两个代表向量之间足够接近的情况下，在这两个代表向量之间存在一条边。In some embodiments, a corresponding graph index is established for the representative vector through a Hierarchical Navigable Small Worlds (HNSW) algorithm. The constructed graph index includes multiple layers of navigable small worlds. Each layer of navigable small worlds includes several representative vectors. The bottom layer includes all representative vectors. The higher the layer, the fewer the number of representative vectors. The top layer has only one entry. A representative vector of points. There is an edge between the two representative vectors in the navigable small world of each layer when the two representative vectors are close enough.

图6示出了一种根据分层可导航小世界构建的图索引示意图。如图6所示，根据分层可导航小世界算法构建的图索引包括3层，第0层为最底层，第2层为最高层，第0层中包括所有代表向量即中心点向量，第1层中包括第0层中的部分中心点向量，第2层中包括第1层中的部分中心点向量，越往上层其中的中心点向量的个数越少，如第0层中包括8个中心点向量，第1层中包括4个中心点向量，第2层中只有1个中心点向量，节点Q为查询向量。其中，每层中存在边的中心点向量之间的平均距离即特征半径是不断减少的，如第1层中4个中心点向量之间的特征半径大于第0层中8个中心点向量之间的特征半径。Fig. 6 shows a schematic diagram of a graph index constructed according to a hierarchical navigable small world. As shown in Figure 6, the graph index constructed according to the hierarchical navigable small world algorithm includes 3 layers, the 0th layer is the bottom layer, the second layer is the highest layer, the 0th layer includes all the representative vectors, that is, the center point vector, and the first layer is the center point vector. Layer 1 includes some center point vectors in layer 0, and layer 2 includes some center point vectors in layer 1. The number of center point vectors in the upper layer is less, for example, layer 0 includes 8 There are four center point vectors in the first layer, and there is only one center point vector in the second layer, and the node Q is the query vector. Among them, the average distance between the center point vectors with edges in each layer, that is, the feature radius is constantly decreasing. For example, the feature radius between the four center point vectors in the first layer is greater than the sum of the eight center point vectors in the 0th layer. feature radius between.

可以理解，使用分层可导航小世界算法为中心点向量建立图索引仅仅是一种示例，并不构成图索引算法的限制，还可以使用其它能够构建图索引的算法如随机图算法、小世界算法等，本申请实施例对此不做具体限制。It can be understood that the use of the hierarchical navigable small world algorithm to build a graph index for the center point vector is only an example, and does not constitute a limitation of the graph index algorithm. Other algorithms that can build graph indexes such as random graph algorithm, small world Algorithms, etc., are not specifically limited in this embodiment of the present application.

通过使用分层可导航小世界算法为中心点向量建立图索引，能够提高查询向量在图索引中进行距离计算的效率，也能够减小与查询向量进行距离计算的中心点向量的数量，进一步提高范围检索的效率。By using the hierarchical navigable small world algorithm to establish a graph index for the center point vector, the efficiency of distance calculation for the query vector in the graph index can be improved, and the number of center point vectors for distance calculation with the query vector can also be reduced. Efficiency of range retrieval.

步骤S506：对原始向量进行压缩，得到压缩后向量。Step S506: Compress the original vector to obtain a compressed vector.

在此，代表向量对应的原始向量通常是维度很大的高维向量，例如维度为128维、256维等，每个维度通常为浮点数，浮点数占用的存储空间至少为4个字节，通常为8个字节，一个维度为128维的原始向量至少占用512个字节的存储空间，而一个代表向量对应的原始向量通常在几万条以上，所占用的存储空间是非常大的，几千乃至数万代表向量对应的原始向量所占用的存储空间更是巨大，几乎不可能同时加载到内存或缓存中，采用部分加载、轮换调度的方式会导致原始向量在内存和磁盘之间进行频繁的调入调出，会严重降低向量范围检索的效率。Here, the original vector corresponding to the representative vector is usually a high-dimensional vector with a large dimension, such as 128-dimensional, 256-dimensional, etc., each dimension is usually a floating-point number, and the storage space occupied by the floating-point number is at least 4 bytes, Usually 8 bytes, an original vector with a dimension of 128 occupies at least 512 bytes of storage space, while the original vector corresponding to a representative vector is usually more than tens of thousands, and the storage space occupied is very large. The storage space occupied by the original vectors corresponding to thousands or even tens of thousands of representative vectors is huge, and it is almost impossible to load them into the memory or cache at the same time. Partial loading and rotation scheduling will cause the original vectors to be stored between memory and disk. Frequent call-in and call-out will seriously reduce the efficiency of vector range retrieval.

本申请的一些实施例中，可以通过对原始向量中的维度数据进行压缩以实现对原始向量的压缩，从而减少原始向量占用的存储空间。In some embodiments of the present application, the original vector can be compressed by compressing the dimension data in the original vector, thereby reducing the storage space occupied by the original vector.

在一些实施例中，通过乘积量化(Product Quantization，PQ)算法对原始向量进行压缩，得到压缩后向量。乘积量化算法是一种检索算法，这里的乘积是笛卡尔积，通过将高维度向量分解为多个低维度向量的笛卡尔积，并对分解得到的多个低维度向量分别做量化，从而高维度向量可以由多个低维度向量的量化编码组合来表示。In some embodiments, the original vector is compressed by a Product Quantization (PQ) algorithm to obtain a compressed vector. The product quantization algorithm is a retrieval algorithm. The product here is a Cartesian product. By decomposing a high-dimensional vector into the Cartesian product of multiple low-dimensional vectors, and quantizing the multiple low-dimensional vectors obtained by the decomposition, the high A dimensional vector can be represented by a quantized coded combination of multiple low-dimensional vectors.

以下以乘积量化方法为例对原始向量的压缩进行具体说明。在此，原始向量的维度为128维，将原始向量划分为8个子向量，每个子向量的维度为16维。图7示出了根据乘积量化算法对原始向量进行压缩的示意图。如图7所示，原始向量划分为8个子向量y₁至y₈，y₁至y₈子向量的维度为16维，将向量空间中每个原始向量的y₁子向量组成一个16维向量的向量集合，对该向量集合通过K-Means算法进行聚类并将聚类的个数设定为256个，进行聚类后得到256个聚类，为每个聚类确定一个中心点向量，从而可以得到256个中心点向量。对这256个中心点向量依次进行编号，编号可以用一个8位的二进制数字表示。计算y₁子向量与256个中心点向量的距离，将与y₁子向量距离最近的中心点向量对应的编号作为y₁子向量对应的压缩编码，即q₁(y₁)，该压缩编码是一个8位的二进制数字。类似地，将向量空间中每个原始向量的y₂子向量组成向量集合并进行聚类，得到另外的256个中心点向量，计算y₂子向量与该256个中心点向量的距离，将与y₂子向量距离最近的中心点向量对应的编号作为y₂子向量对应的压缩编码，即q₂(y₂)。y₃至y₈子向量的压缩编码依次类推。最终，128维的原始向量可以压缩为8个子向量的8位压缩编码的组合即64位压缩编码。The compression of the original vector will be specifically described below by taking the product quantization method as an example. Here, the dimension of the original vector is 128 dimensions, the original vector is divided into 8 sub-vectors, and the dimension of each sub-vector is 16 dimensions. FIG. 7 shows a schematic diagram of compressing the original vector according to the product quantization algorithm. As shown in Figure 7, the original vector is divided into 8 sub-vectors y ₁ to y ₈ , the dimensions of the y ₁ to y ₈ sub-vectors are 16 dimensions, and the y ₁ sub-vectors of each original vector in the vector space are formed into a 16-dimensional vector The vector set is clustered by K-Means algorithm and the number of clusters is set to 256. After clustering, 256 clusters are obtained, and a center point vector is determined for each cluster. Thus, 256 center point vectors can be obtained. The 256 center point vectors are sequentially numbered, and the number can be represented by an 8-bit binary number. Calculate the distance between the y ₁ sub-vector and the 256 center point vectors, and use the number corresponding to the center point vector with the closest distance to the y ₁ sub-vector as the compression code corresponding to the y ₁ sub-vector, that is, q ₁ (y ₁ ), the compression code is an 8-bit binary number. Similarly, the y ₂ sub-vectors of each original vector in the vector space are formed into a vector set and clustered to obtain another 256 center point vectors, and the distance between the y ₂ sub-vector and the 256 center point vectors is calculated. The number corresponding to the center point vector with the closest distance between the y ₂ sub-vectors is used as the compression code corresponding to the y ₂ sub-vectors, that is, q ₂ (y ₂ ). The compression encoding of the y ₃ to y ₈ sub-vectors is analogous. Finally, the 128-dimensional original vector can be compressed into a combination of 8-bit compression codes of 8 sub-vectors, that is, 64-bit compression codes.

可以理解，使用乘积量化算法对原始向量进行压缩仅仅是一种示例，并不构成对原始向量进行压缩所使用算法的限制，还可以使用其它能够对原始向量进行压缩的算法如标量量化(Scalar Quantization，SQ)、加性量化(Additive Quantization，AQ)等，本申请实施例对此不做具体限制。It can be understood that the use of the product quantization algorithm to compress the original vector is only an example, and does not constitute a limitation of the algorithm used to compress the original vector. Other algorithms that can compress the original vector such as Scalar Quantization can also be used. , SQ), additive quantization (Additive Quantization, AQ), etc., which are not specifically limited in this embodiment of the present application.

在一些实施例中，每个代表向量对应的原始向量分别存储在不同的倒排文件中，可以对原始向量进行压缩后得到的压缩后向量分别存储在与代表向量对应的另一个倒排文件中，该倒排文件用于存储与代表文件对应的压缩后向量。In some embodiments, the original vectors corresponding to each representative vector are stored in different inverted files respectively, and the compressed vectors obtained by compressing the original vectors can be stored in another inverted file corresponding to the representative vector respectively. , the inverted file is used to store the compressed vector corresponding to the representative file.

另外，将向量空间的最大半径、图索引、原始向量的压缩后向量和进行乘积量化时使用的多组中心点向量及其编号进行持久化存储，在进行范围检索时使用。In addition, the maximum radius of the vector space, the graph index, the compressed vector of the original vector, and the multiple sets of center point vectors and their numbers used for product quantization are persistently stored, and used for range retrieval.

本申请实施例通过在作为一级索引的向量空间的中心点向量的基础上建立作为二级索引的图索引，能够提高向量范围检索时的检索效率，同时将中心点向量对应的原始向量进行压缩得到压缩后向量，使得向量范围检索时能够将查询向量在压缩后向量中进行初次检索，排除掉不满足预设检索半径的压缩后向量，能够缩小查询向量的后续检索范围，从而在不降低准确率的基础上提高检索效率。In the embodiment of the present application, by establishing a graph index as a secondary index on the basis of the center point vector of the vector space as a primary index, the retrieval efficiency during vector range retrieval can be improved, and the original vector corresponding to the center point vector can be compressed at the same time. The compressed vector is obtained, so that the query vector can be retrieved in the compressed vector for the first time when the vector range is retrieved, and the compressed vector that does not meet the preset retrieval radius can be excluded, and the subsequent retrieval range of the query vector can be narrowed, so as not to reduce the accuracy Improve retrieval efficiency on the basis of rate.

可以理解，本申请的另外一些实施例中，可以对中心点向量进行压缩以减少占用的存储空间，但是会降低向量范围检索的准确率，在向量范围检索的准确率要求不太高的情况下，可以在存储空间有限的设备上对中心点向量进行压缩。It can be understood that in other embodiments of the present application, the center point vector can be compressed to reduce the occupied storage space, but the accuracy of vector range retrieval will be reduced. , the center point vector can be compressed on devices with limited storage space.

以上实施例介绍了为向量空间中向量建立适合向量范围检索的索引的过程，下面实施例给出根据建立的索引进行向量范围检索过程的解释说明。The above embodiment introduces the process of establishing an index suitable for vector range retrieval for a vector in the vector space. The following embodiment provides an explanation of the process of performing vector range retrieval according to the established index.

图8示出了根据本申请的电子设备实施的向量范围检索的方法的流程示意图。下面以服务器100为查询组件，向量数据库为数据存储设备为例进行说明。如图8所示，根据查询向量和检索半径进行向量范围检索可以包括如下步骤：FIG. 8 shows a schematic flowchart of a method for vector range retrieval implemented by an electronic device according to the present application. The following description takes the server 100 as the query component and the vector database as the data storage device as an example for description. As shown in Figure 8, the vector range retrieval based on the query vector and retrieval radius may include the following steps:

步骤S801：获取查询向量和检索半径。Step S801: Obtain a query vector and a retrieval radius.

在此，服务器100可以首先获取查询向量和用于限制检索结果范围的检索半径，检索半径通常使用表示向量相似度差异的数值来表示，查询向量通常是与向量空间中原始向量的维度相同的向量。Here, the server 100 may first obtain a query vector and a retrieval radius used to limit the range of retrieval results, the retrieval radius is usually represented by a numerical value representing the similarity difference between vectors, and the query vector is usually a vector with the same dimension as the original vector in the vector space .

在一些实施例中，服务器100可以接收非结构化数据，再对非结构化数据进行转换以得到对应的查询向量。用于进行范围检索的非结构化数据可以是文本数据、图片数据、音频数据和视频数据中的一种或多种，例如，非结构化数据可以仅是图片，也可以同时包括文本和图片。In some embodiments, the server 100 may receive unstructured data, and then transform the unstructured data to obtain corresponding query vectors. The unstructured data used for range retrieval can be one or more of text data, picture data, audio data and video data. For example, the unstructured data can be only pictures, or can include both text and pictures.

具体来说，服务器100可以对接收的非结构化数据的类型进行检查，再根据非结构化数据的类型使用不同的向量转换方法进行转换。例如，服务器100对非结构化数据的检查结果是图片，则使用图片转换为向量的方法将图片转换为向量；对非结构化数据的检查结果是文本，则使用文本转换为向量的方法将文本转换为向量。在此，图片转换为向量的方法例如可以是通过深度神经网络提取图片特征向量的方法，文本转换为向量的方法例如可以是通过先分词再将分词转换为对应向量的方法等。Specifically, the server 100 may check the type of the received unstructured data, and then use different vector transformation methods to convert according to the type of the unstructured data. For example, if the inspection result of the unstructured data by the server 100 is an image, the image is converted into a vector using the method of converting image to vector; if the inspection result of the unstructured data is text, the text into vector method is used to convert the text into a vector. Convert to vector. Here, the method of converting a picture into a vector may be, for example, a method of extracting a feature vector of a picture through a deep neural network, and the method of converting a text to a vector may be, for example, a method of firstly segmenting words and then converting the segmented words into corresponding vectors.

在一些实施例中，服务器100可以直接接收查询向量，该查询向量由客户端300对用户提交的非结构化数据进行转换得到，具体来说由客户端300对非结构化数据的类型进行检查并根据相应类型进行向量转换得到查询向量。In some embodiments, the server 100 may directly receive a query vector, where the query vector is obtained by converting the unstructured data submitted by the user by the client 300. Specifically, the client 300 checks the type of unstructured data and obtains it. The query vector is obtained by vector conversion according to the corresponding type.

可以理解，服务器100获取的检索半径可以由用户通过客户端300指定，也可以由预设值确定。例如，客户端300仅向服务器提供了查询向量，未提供检索半径，服务器可以通过读取检索半径的默认值来获取检索半径。It can be understood that the retrieval radius obtained by the server 100 may be specified by the user through the client terminal 300, or may be determined by a preset value. For example, the client 300 only provides the query vector to the server and does not provide the retrieval radius, and the server can obtain the retrieval radius by reading the default value of the retrieval radius.

步骤S802：将图索引和压缩后向量从磁盘加载到内存。Step S802: Load the graph index and the compressed vector from the disk into the memory.

在此，图索引和压缩后向量存储在数据存储设备200的磁盘上，要使用图索引和压缩后向量进行范围检索，需要将图索引和压缩后向量从磁盘加载到内存中。内存的容量通常较小但是访问速度快，磁盘通常容量较大但是访问速度慢，在需要进行范围检索时才将图索引和压缩后向量加载到内存，能够充分利用内存和磁盘的各自特点。Here, the graph index and the compressed vector are stored on the disk of the data storage device 200. To use the graph index and the compressed vector for range retrieval, the graph index and the compressed vector need to be loaded from the disk into the memory. The memory capacity is usually small but the access speed is fast, and the disk capacity is usually large but the access speed is slow. The graph index and the compressed vector are loaded into the memory only when the range retrieval is required, which can make full use of the respective characteristics of the memory and the disk.

可以理解，除了将图索引和压缩后向量从磁盘加载到内存，还可以将图索引和压缩后向量从内存加载到比内存访问速度更快的设备如图形处理器(Graphics ProcessingUnit，GPU)等，本申请实施例对此不做具体限制。It can be understood that in addition to loading the graph index and compressed vector from disk to memory, the graph index and compressed vector can also be loaded from memory to a device with a faster access speed than memory, such as a graphics processor (Graphics Processing Unit, GPU), etc., This embodiment of the present application does not specifically limit this.

在一些实施例中，服务器100可以将图索引和压缩后向量从数据存储设备200的磁盘加载到服务器100的内存中，从而在服务器100上进行范围检索。In some embodiments, the server 100 may load the graph index and the compressed vector from the disk of the data storage device 200 into the memory of the server 100 to perform range retrieval on the server 100 .

在一些实施例中，压缩后向量存储在倒排文件中，服务器100对压缩后向量的加载可以依次读取倒排文件中的压缩后向量，将读取的压缩后向量加载到内存中。In some embodiments, the compressed vector is stored in the inverted file, and the server 100 may sequentially read the compressed vector in the inverted file when loading the compressed vector, and load the read compressed vector into the memory.

图9示出了一种服务器100加载图索引和压缩后向量的示意图。如图9所示，加载后，服务器100的内存910中包括由中心点向量构成的图索引911和多个保存有压缩后向量的倒排文件912，磁盘920上存储有多个保存原始向量的倒排文件921。可以理解，磁盘920可以是服务器100上的磁盘，也可以是数据存储设备200的磁盘。FIG. 9 shows a schematic diagram of the server 100 loading a graph index and a compressed vector. As shown in FIG. 9 , after loading, the memory 910 of the server 100 includes a graph index 911 composed of a center point vector and a plurality of inverted files 912 that store the compressed vectors, and the disk 920 stores a plurality of files that store the original vectors. Invert file 921. It can be understood that the disk 920 may be a disk on the server 100 or a disk of the data storage device 200 .

步骤S803：根据查询向量在图索引中检索，确定目标中心点向量。Step S803: Search in the graph index according to the query vector, and determine the target center point vector.

在此，服务器100还需要将前述构建向量空间的索引过程中存储的向量空间的最大半径加载到内存或缓存中，该最大半径用于描述向量空间中中心点向量与对应的原始向量之间的最大距离。Here, the server 100 also needs to load the maximum radius of the vector space stored in the process of constructing the index of the vector space into the memory or the cache, where the maximum radius is used to describe the distance between the center point vector in the vector space and the corresponding original vector. maximum distance.

在一些实施例中，服务器100根据查询向量和检索半径在图索引中检索，确定满足范围检索条件的目标中心点向量。具体来说，图索引中的节点是中心点向量，图索引中的边表明有边连接的两个中心点向量之间较为相似。本申请的实施例中，中心点向量之间的相似度用两者之间的距离来衡量。将查询向量与图索引中的节点进行相似度计算，并确定两者之间的相似度差异是否满足范围检索的条件，例如，这里的范围检索条件是指相似度差异小于检索半径，在满足范围检索条件的情况下，将该节点确定为目标中心点向量，再对该节点的相邻节点即与该节点有边连接的其它节点进行相似度计算，确定相邻节点中是否存在目标中心点向量，如此进行广度优先搜索，直到查询节点与图索引中节点的相似度差异均大于检索半径。In some embodiments, the server 100 searches the graph index according to the query vector and the search radius, and determines the target center point vector that satisfies the range search condition. Specifically, the node in the graph index is the center point vector, and the edge in the graph index indicates that the two center point vectors connected by an edge are relatively similar. In the embodiment of the present application, the similarity between the center point vectors is measured by the distance between them. Calculate the similarity between the query vector and the nodes in the graph index, and determine whether the similarity difference between the two satisfies the conditions for range retrieval. In the case of retrieval conditions, the node is determined as the target center point vector, and then the adjacent nodes of the node, that is, other nodes that are edge-connected to the node, perform similarity calculation to determine whether there is a target center point vector in the adjacent nodes. , and perform a breadth-first search until the similarity difference between the query node and the nodes in the graph index is greater than the retrieval radius.

以下以分层可导航小世界算法构建的图索引为例进行说明。如图6所示，通过分层可导航小世界算法建立的图索引为3层，第0层包括8个中心点向量，第1层包括4个中心点向量，第2层包括一个中心点向量。进行检索时，查询向量Q首先从最上层即第2层开始，由于第2层只有一个中心点向量A₁，与查询向量Q的距离最小的中心点向量就是A₁，将A₁作为目标中心点向量；接下来在下面一层即第1层中进行检索，从中心点向量B₁开始进行检索，B₁是与A₁完全相同的中心点向量，B₁有1个相邻节点B₂，计算查询向量Q与中心点向量B₂的距离，同时再计算B₂的相邻节点与查询向量Q之间的距离，B₂还有另外2个相邻节点B₃和B₄，即再计算查询向量Q分别与中心点向量B₃和B₄的距离，比较这3个距离可知，与查询向量Q的距离最小的是中心点向量B₃，将B₃作为最新的目标中心点向量；接下来在最下层即第0层中进行检索，从中心点向量C₃开始进行检索，C₃是与B₃完全相同的中心点向量，C₃有2个相邻节点C₅和C₇，分别计算查询向量Q与C₅和C₇的距离，再对C₅和C₇的相邻节点进行查询向量Q的距离计算，C₅有另外的相邻节点C₆，C₇有另外的相邻节点C₈，因此再计算查询向量Q与C₆和C₈的距离，比较这些计算出的距离可知C₆是与查询向量Q距离最近且距离小于检索半径与向量空间的最大半径之和的中心点向量，因此将C₆确定为最新的目标中心点向量。另外，通过计算查询向量Q与C₆和C₈的相邻节点的距离可知得到的距离均大于检索半径和向量空间的最大半径之和，因此停止进行广度优先检索，得到的C₆为最终的目标中心点向量。The following description takes the graph index constructed by the hierarchical navigable small world algorithm as an example. As shown in Figure 6, the graph index established by the hierarchical navigable small world algorithm is 3 layers, the 0th layer includes 8 center point vectors, the first layer includes 4 center point vectors, and the second layer includes a center point vector . When searching, the query vector Q starts from the top layer, that is, the second layer. Since the second layer has only one center point vector A ₁ , the center point vector with the smallest distance from the query vector Q is A ₁ , and A ₁ is used as the target center. Point vector; then search in the next layer, that is, the first layer, starting from the center point vector B ₁ , B ₁ is the exact same center point vector as A ₁ , and B ₁ has 1 adjacent node B ₂ , calculate the distance between the query vector Q and the center point vector B ₂ , and at the same time calculate the distance between the adjacent nodes of B ₂ and the query vector Q, B ₂ has two other adjacent nodes B ₃ and B ₄ , that is, again Calculate the distances between the query vector Q and the center point vectors B ₃ and B ₄ respectively. Comparing these three distances, it can be seen that the center point vector B ₃ has the smallest distance from the query vector Q, and B ₃ is used as the latest target center point vector; Next, the search is carried out in the lowest layer, the 0th layer, starting from the center point vector C ₃ , which is the exact same center point vector as B ₃ , and C ₃ has ₂ adjacent nodes C ₅ and C ₇ , Calculate the distance between the query vector Q and C ₅ and C ₇ respectively, and then calculate the distance of the query vector Q for the adjacent nodes of C ₅ and C _7. C ₅ has another adjacent node C ₆ , and C ₇ has another phase. Adjacent node C ₈ , so the distance between query vector Q and C ₆ and C ₈ is calculated again. Comparing these calculated distances, it can be seen that C ₆ is the closest to the query vector Q and the distance is less than the sum of the retrieval radius and the maximum radius of the vector space. center point vector, so C ₆ is determined as the latest target center point vector. _In addition, by calculating the distance between the query vector Q and the adjacent nodes of _C6 and C8, it can be seen that the obtained distance is greater than the sum of the retrieval radius and the maximum radius of the vector space, so stop the breadth-first retrieval, and the obtained _C6 is the final Target center point vector.

在一些实施例中，在分层可导航小世界算法构建的图索引中进行范围检索时，目标中心点向量与查询向量之间的距离小于检索半径与向量空间的最大半径之和。通过将目标中心点向量与查询向量之间的距离扩大到检索半径与向量空间的最大半径之和的范围，能够避免遗漏目标中心点向量，进而避免遗漏可能的检索结果。In some embodiments, when performing range retrieval in a graph index constructed by a hierarchical navigable small world algorithm, the distance between the target center point vector and the query vector is less than the sum of the retrieval radius and the maximum radius of the vector space. By expanding the distance between the target center point vector and the query vector to the range of the sum of the retrieval radius and the maximum radius of the vector space, it is possible to avoid missing the target center point vector, thereby avoiding missing possible retrieval results.

在一些实施例中，在分层可导航小世界算法构建的图索引中进行范围检索时，以查询向量与通过广度优先得到的中心点向量之间的距离均大于检索半径和向量空间的最大半径之和作为停止检索的条件。通过上述条件来确定在图索引中停止检索的时间，能够避免扩大范围检索的范围，减少查询向量与中心点向量的计算量，提高范围检索的效率，同时也能够保证检索结果的准确性，避免遗漏可能的检索结果。In some embodiments, when performing range retrieval in the graph index constructed by the hierarchical navigable small world algorithm, the distance between the query vector and the center point vector obtained by breadth first is greater than the retrieval radius and the maximum radius of the vector space. The sum is used as the condition to stop the retrieval. Using the above conditions to determine the time to stop the retrieval in the graph index can avoid expanding the scope of the range retrieval, reduce the calculation amount of the query vector and the center point vector, improve the efficiency of the range retrieval, and also ensure the accuracy of the retrieval results. Missing possible search results.

可以理解，得到的目标中心点向量可以是一个或多个，本申请实施例对此不做具体限制。It can be understood that there may be one or more target center point vectors, which are not specifically limited in this embodiment of the present application.

步骤S804：获取目标中心点向量对应的压缩后向量，并确定满足范围检索条件的目标压缩后向量。Step S804: Obtain the compressed vector corresponding to the target center point vector, and determine the target compressed vector that satisfies the range retrieval condition.

在一些实施例中，可以将查询向量与压缩后向量进行距离计算，并将满足预设的距离阈值条件的压缩后向量确定为目标压缩后向量。在此，预设的距离阈值可以是检索半径的alpha倍，alpha是一个经验值，可以根据生产环境中的经验来设定。In some embodiments, a distance calculation may be performed between the query vector and the compressed vector, and a compressed vector satisfying a preset distance threshold condition is determined as the target compressed vector. Here, the preset distance threshold can be alpha times the retrieval radius, and alpha is an empirical value that can be set based on experience in a production environment.

具体来说，将查询向量与通过乘积量化算法得到的压缩后向量进行乘积量化距离计算，将满足乘积量化距离小于与检索半径相关的距离阈值的压缩后向量作为目标压缩后向量。Specifically, the product quantization distance is calculated between the query vector and the compressed vector obtained by the product quantization algorithm, and the compressed vector that satisfies the product quantization distance less than the distance threshold related to the retrieval radius is used as the target compressed vector.

通过上述方法可以对压缩后向量进行筛选，挑选出可能在范围检索的检索范围内的压缩后向量，从而减小了后续检索的检索范围，能够提高范围检索的效率。Through the above method, the compressed vectors can be screened, and the compressed vectors that may be within the retrieval range of the range retrieval can be selected, thereby reducing the retrieval scope of the subsequent retrieval and improving the efficiency of the scope retrieval.

在一些实施例中，可以直接将查询向量与通过乘积量化算法得到的压缩后向量进行乘积量化距离计算。In some embodiments, the query vector and the compressed vector obtained by the product quantization algorithm may be directly used for product quantization distance calculation.

例如，将维度为128维的原始向量划分为8个16维的子向量，将多个原始向量中相同位置的子向量进行聚类得到256个16维的中心点向量，将原始向量的8个子向量分别使用距离最近的中心点向量的标识进行替换得到压缩后向量，压缩后向量包括8个中心点向量的标识，在查询向量与压缩后向量进行乘积量化距离计算时，将压缩后向量根据中心点向量的标识还原为由8个中心点向量组成的向量，最后将该向量与查询向量进行距离计算。在此，由于中心点向量在根据查询向量进行范围查询前就已经确定，因此可以预先计算中心点向量之间的距离，进行范围查询时可以通过查询距离表来加速相关的距离计算。For example, divide a 128-dimensional original vector into 8 16-dimensional sub-vectors, cluster the sub-vectors at the same position in multiple original vectors to obtain 256 16-dimensional center point vectors, and divide the 8 sub-vectors of the original vector into The vectors are respectively replaced with the identifier of the center point vector with the closest distance to obtain the compressed vector. The compressed vector includes the identifiers of 8 center point vectors. When the query vector and the compressed vector are multiplied and quantized for distance calculation, the compressed vector is calculated according to the center point vector. The identification of the point vector is restored to a vector consisting of 8 center point vectors, and finally the distance between the vector and the query vector is calculated. Here, since the center point vector has been determined before the range query is performed according to the query vector, the distance between the center point vectors can be pre-calculated, and the distance table can be queried to speed up the related distance calculation when performing the range query.

在另外一些实施例中，可以将查询向量按照与压缩后向量一致的子向量数量、中心点向量数量及对应编号等参数进行乘积量化，再将量化后的查询向量与压缩后向量进行乘积量化距离计算。具体计算方法与上述直接将查询向量与压缩后向量进行乘积量化距离计算的方法类似。In other embodiments, the query vector may be multiplied and quantized according to parameters such as the number of sub-vectors consistent with the compressed vector, the number of center point vectors, and the corresponding number, and then the quantized query vector and the compressed vector may be multiplied to quantize the distance. calculate. The specific calculation method is similar to the above-mentioned method of directly calculating the product quantization distance between the query vector and the compressed vector.

步骤S805：获取目标压缩后向量对应的原始向量，并确定满足范围检索条件的目标原始向量。Step S805: Obtain the original vector corresponding to the target compressed vector, and determine the target original vector that satisfies the range retrieval condition.

在此，服务器100在确定目标压缩向量后，再从磁盘上加载目标压缩后向量对应的原始向量，从而避免将占用较大存储空间的原始向量进行预先加载，减少内存的过度占用。Here, after determining the target compression vector, the server 100 loads the original vector corresponding to the target compressed vector from the disk, so as to avoid preloading the original vector that occupies a large storage space and reduce excessive memory usage.

在一些实施例中，服务器100获取压缩后向量对应的原始向量，可以通过压缩后向量对应的中心点向量来确定该压缩后向量对应的原始向量，例如，中心点向量有对应的2个倒排文件，一个倒排文件存储压缩后向量，另一个倒排文件存储原始向量，根据压缩后向量在倒排文件中的序号将另一个倒排文件中相同序号的原始向量确定为该压缩后向量对应的原始向量。In some embodiments, the server 100 obtains the original vector corresponding to the compressed vector, and the original vector corresponding to the compressed vector may be determined by the center point vector corresponding to the compressed vector. For example, the center point vector has two corresponding inversions file, one inverted file stores the compressed vector, and the other inverted file stores the original vector. According to the sequence number of the compressed vector in the inverted file, the original vector with the same sequence number in the other inverted file is determined as the corresponding compressed vector. the original vector.

在一些实施例中，将查询向量与原始向量进行距离计算，将距离小于检索半径的原始向量确定为目标原始向量。这里的范围检索条件是查询向量与原始向量的距离小于检索半径。在此，将查询向量与可能在范围检索的检索半径内的原始向量直接进行距离计算，可以直接确定原始向量是否满足范围检索的检索半径要求，能够提高范围检索的准确性，避免将不在检索半径内的原始向量作为检索结果。In some embodiments, the distance between the query vector and the original vector is calculated, and the original vector whose distance is smaller than the retrieval radius is determined as the target original vector. The range retrieval condition here is that the distance between the query vector and the original vector is less than the retrieval radius. Here, the distance calculation between the query vector and the original vector that may be within the retrieval radius of the range retrieval can directly determine whether the original vector meets the retrieval radius requirement of the range retrieval, which can improve the accuracy of the range retrieval and avoid the possibility of not being in the retrieval radius. The original vector within is used as the retrieval result.

步骤S806：返回目标原始向量。在此，将目标原始向量作为范围检索的检索结果返回给客户端300。Step S806: Return to the original target vector. Here, the target original vector is returned to the client 300 as the retrieval result of the range retrieval.

可以理解，服务器100可以将目标原始向量直接返回给客户端300，也可以将目标原始向量转换为非结构化数据，将非结构化数据返回给客户端300，本申请实施例对此不做具体限制。It can be understood that the server 100 can directly return the target original vector to the client 300, or can convert the target original vector into unstructured data, and return the unstructured data to the client 300, which is not specifically described in this embodiment of the present application. limit.

在一些实施例中，服务器100可以将目标原始向量转换为客户端300提交的非结构化数据的类型，例如，用户通过客户端300提交的非结构化数据的类型是图片，则服务器100将目标原始向量转换为图片，即实现向量到图片的转换；用户通过客户端300提交的非结构化数据的类型是视频，则服务器100将目标原始向量转换为视频，即实现向量到视频的转换。In some embodiments, the server 100 can convert the target original vector into the type of unstructured data submitted by the client 300. For example, if the type of unstructured data submitted by the user through the client 300 is a picture, the server 100 converts the target The original vector is converted into a picture, that is, the conversion from vector to picture is realized; the type of unstructured data submitted by the user through the client 300 is video, then the server 100 converts the target original vector into a video, that is, realizes the conversion from vector to video.

在一些实施例中，用户可以通过客户端300指定返回范围检索结果的类型，服务器100得到目标原始向量后，根据用户指定的返回检索结果的类型对目标原始向量进行转换，例如，用户通过客户端300指定返回检索结果的类型为图片，即使用户提交的非结构化数据的类型为文本，服务器100仍会将目标原始向量转换为图片并返回给客户端300。In some embodiments, the user can specify the type of the returned range retrieval result through the client 300. After the server 100 obtains the target original vector, it converts the target original vector according to the type of the returned retrieval result specified by the user. 300 specifies that the type of the returned retrieval result is an image. Even if the type of unstructured data submitted by the user is text, the server 100 will still convert the target original vector into an image and return it to the client 300 .

可以理解，用户指定的返回检索结果可以为多种类型，服务器100将目标原始向量转换为相应的多种非结构化数据的类型后返回至客户端300。It can be understood that the returned retrieval result specified by the user can be of various types, and the server 100 converts the target original vector into corresponding various types of unstructured data and returns it to the client 300 .

可以理解，服务器100将目标原始向量直接返回给客户端300，客户端300可以完成上述将目标原始向量转换为相应的非结构化数据的处理工作。It can be understood that the server 100 directly returns the target original vector to the client 300, and the client 300 can complete the above processing work of converting the target original vector into corresponding unstructured data.

根据本申请的一个实施例，还提供了一种向量范围检索的装置。如图10所示，向量范围检索的装置1000包括：According to an embodiment of the present application, an apparatus for vector range retrieval is also provided. As shown in Figure 10, the apparatus 1000 for vector range retrieval includes:

获取模块1001，用于获取查询向量和检索半径；an acquisition module 1001, used to acquire a query vector and a retrieval radius;

加载模块1002，用于将图索引和压缩后向量从第一存储区加载到第二存储区，第一存储区的读/写速度小于第二存储区；The loading module 1002 is used to load the graph index and the compressed vector from the first storage area to the second storage area, and the read/write speed of the first storage area is lower than that of the second storage area;

目标代表向量确定模块1003，用于在第二存储区至少根据查询向量和检索半径，确定图索引中与查询向量距离最接近的至少一个目标代表向量；The target representative vector determination module 1003 is used to determine, in the second storage area, at least one target representative vector whose distance is closest to the query vector in the graph index at least according to the query vector and the retrieval radius;

目标压缩后向量确定模块1004，用于在第二存储区获取至少一个目标代表向量对应的压缩后向量，并根据查询向量与压缩后向量之间的距离确定目标压缩后向量；The target compressed vector determination module 1004 is used to obtain the compressed vector corresponding to at least one target representative vector in the second storage area, and determine the target compressed vector according to the distance between the query vector and the compressed vector;

目标原始向量确定模块1005，用于根据目标压缩后向量，从第一存储区获取目标压缩后向量对应的原始向量，并根据查询向量与原始向量之间的距离确定目标原始向量。The target original vector determination module 1005 is configured to obtain the original vector corresponding to the target compressed vector from the first storage area according to the target compressed vector, and determine the target original vector according to the distance between the query vector and the original vector.

根据本申请的一个实施例，还提供了一种构建向量索引的装置。如图11所示，构建向量索引的装置1100包括：According to an embodiment of the present application, an apparatus for constructing a vector index is also provided. As shown in FIG. 11, the apparatus 1100 for constructing a vector index includes:

获取模块1101，用于获取向量数据集；an acquisition module 1101, used to acquire a vector data set;

聚类模块1102，用于根据向量数据集确定多个聚类及聚类的代表向量；Clustering module 1102, configured to determine a plurality of clusters and representative vectors of clusters according to the vector data set;

图索引创建模块1103，用于基于代表向量建立图索引；The graph index creation module 1103 is used for establishing graph index based on the representative vector;

压缩模块1104，用于获取代表向量对应的原始向量，并对原始向量进行压缩得到压缩后向量。The compression module 1104 is configured to obtain the original vector corresponding to the representative vector, and compress the original vector to obtain a compressed vector.

本申请公开的机制的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本申请的实施例可实现为在可编程系统上执行的计算机程序或程序代码，该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementation methods. Embodiments of the present application may be implemented as a computer program or program code executing on a programmable system including at least one processor, a storage system (including volatile and nonvolatile memory and/or storage elements) , at least one input device, and at least one output device.

可将程序代码应用于输入指令，以执行本申请描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的，处理系统包括具有诸如例如数字信号处理器(Digital Signal Processor，DSP)、微控制器、专用集成电路(Application Specific Integrated Circuit，ASIC)或微处理器之类的处理器的任何系统。Program code may be applied to input instructions to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any processor having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor system.

程序代码可以用高级程序化语言或面向对象的编程语言来实现，以便与处理系统通信。在需要时，也可用汇编语言或机器语言来实现程序代码。事实上，本申请中描述的机制不限于任何特定编程语言的范围。在任一情形下，该语言可以是编译语言或解释语言。The program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled language or an interpreted language.

至少一个实施例的一个或多个方面可以由存储在计算机可读存储介质上的表示性指令来实现，指令表示处理器中的各种逻辑，指令在被机器读取时使得该机器制作用于执行本文所述的技术的逻辑。被称为“IP核”的这些表示可以被存储在有形的计算机可读存储介质上，并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。One or more aspects of at least one embodiment may be implemented by representative instructions stored on a computer-readable storage medium, the instructions representing various logic in a processor, the instructions, when read by a machine, cause the machine to make Logic that implements the techniques described herein. These representations, referred to as "IP cores," may be stored on tangible computer-readable storage media and provided to multiple customers or production facilities for loading into the manufacturing machines that actually manufacture the logic or processors.

在一些情况下，所公开的实施例可以以硬件、固件、软件或其任何组合来实现。所公开的实施例还可以被实现为由一个或多个暂时或非暂时性机器可读(例如，计算机可读)存储介质承载或存储在其上的指令，其可以由一个或多个处理器读取和执行。例如，指令可以通过网络或通过其他计算机可读介质分发。因此，机器可读介质可以包括用于以机器(例如，计算机)可读的形式存储或传输信息的任何机制，包括但不限于，软盘、光盘、光碟、只读存储器(CD-ROMs)、磁光盘、只读存储器(Read Only Memory，ROM)、随机存取存储器(RandomAccess Memory，RAM)、可擦除可编程只读存储器(Erasable Programmable Read OnlyMemory，EPROM)、电可擦除可编程只读存储器(Electrically Erasable ProgrammableRead-Only Memory，EEPROM)、磁卡或光卡、闪存、或用于利用因特网以电、光、声或其他形式的传播信号来传输信息(例如，载波、红外信号数字信号等)的有形的机器可读存储器。因此，机器可读介质包括适合于以机器(例如计算机)可读的形式存储或传输电子指令或信息的任何类型的机器可读介质。In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (eg, computer-readable) storage media, which can be executed by one or more processors read and execute. For example, the instructions may be distributed over a network or over other computer-readable media. Thus, a machine-readable medium can include any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer), including, but not limited to, floppy disks, optical disks, optical disks, read only memories (CD-ROMs), magnetic Optical disc, Read Only Memory (ROM), Random Access Memory (RAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), magnetic or optical cards, flash memory, or devices used to transmit information (eg, carrier waves, infrared digital signals, etc.) using the Internet in electrical, optical, acoustic, or other forms of propagation signals Tangible machine-readable storage. Thus, machine-readable media includes any type of machine-readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).

在附图中，可以以特定布置和/或顺序示出一些结构或方法特征。然而，应该理解，可能不需要这样的特定布置和/或排序。而是，在一些实施例中，这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外，在特定图中包括结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征，并且在一些实施例中，可以不包括这些特征或者可以与其他特征组合。In the drawings, some structural or method features may be shown in specific arrangements and/or sequences. It should be understood, however, that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. Additionally, the inclusion of structural or method features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments such features may not be included or may be combined with other features.

需要说明的是，本申请各设备实施例中提到的各单元/模块都是逻辑单元/模块，在物理上，一个逻辑单元/模块可以是一个物理单元/模块，也可以是一个物理单元/模块的一部分，还可以以多个物理单元/模块的组合实现，这些逻辑单元/模块本身的物理实现方式并不是最重要的，这些逻辑单元/模块所实现的功能的组合才是解决本申请所提出的技术问题的关键。此外，为了突出本申请的创新部分，本申请上述各设备实施例并没有将与解决本申请所提出的技术问题关系不太密切的单元/模块引入，这并不表明上述设备实施例并不存在其它的单元/模块。It should be noted that each unit/module mentioned in each device embodiment of this application is a logical unit/module. Physically, a logical unit/module may be a physical unit/module or a physical unit/module. A part of a module can also be implemented by a combination of multiple physical units/modules. The physical implementation of these logical units/modules is not the most important, and the combination of functions implemented by these logical units/modules is the solution to the problem of this application. The crux of the technical question raised. In addition, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules that are not closely related to solving the technical problems raised in the present application, which does not mean that the above-mentioned device embodiments do not exist. other units/modules.

需要说明的是，在本专利的示例和说明书中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in the examples and specification of this patent, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that Any such actual relationship or sequence exists between these entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a" does not preclude the presence of additional identical elements in a process, method, article, or device that includes the element.

虽然通过参照本申请的某些优选实施例，已经对本申请进行了图示和描述，但本领域的普通技术人员应该明白，可以在形式上和细节上对其作各种改变，而不偏离本申请的精神和范围。Although the present application has been illustrated and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and detail may be made therein without departing from the present disclosure The spirit and scope of the application.

Claims

1. a method for vector range retrieval, for electronic equipment, is characterized in that, the method comprises:

Get the query vector and retrieval radius;

Loading the graph index and the compressed vector from the first storage area to the second storage area, wherein the read/write speed of the first storage area is lower than that of the second storage area;

In the second storage area, at least one target representative vector whose distance is closest to the query vector in the graph index is determined according to at least the query vector and the retrieval radius;

Obtain the compressed vector corresponding to the at least one target representative vector in the second storage area, and determine the target compressed vector at least according to the distance between the query vector and the compressed vector;

According to the target compressed vector, the original vector corresponding to the target compressed vector is obtained from the first storage area, and the target original vector is determined according to the distance between the query vector and the original vector.

2. The method according to claim 1, wherein, at least according to the query vector and the retrieval radius, determining at least one target representative vector that is closest to the query vector distance in the graph index, comprising:

determining the query vector and the retrieval radius, and further determining the maximum radius of the vector data set;

The representative vector in the graph index is acquired in a breadth-first manner, and at least one target representative vector whose distance from the query vector is smaller than the sum of the retrieval radius and the maximum radius of the vector data set is determined.

3. The method according to claim 1, wherein determining the target compressed vector at least according to the distance between the query vector and the compressed vector, comprising:

calculating the product quantized distance between the query vector and the compressed vector;

A target compressed vector whose product quantization distance is smaller than a preset multiple of the retrieval radius is determined.

4. The method according to claim 1, wherein determining the target original vector according to the distance between the query vector and the original vector, comprising:

calculating the distance between the query vector and the original vector;

A target raw vector whose distance is less than the retrieval radius is determined.

5. The method according to claim 1, wherein the method further comprises:

Convert the target raw vector to unstructured data, and return the unstructured data.

6. A method for constructing a vector index for electronic equipment, characterized in that the method comprises:

get vector dataset;

Determine a plurality of clusters and representative vectors of the clusters according to the vector data set;

establishing a graph index based on the representative vector;

Obtain the original vector corresponding to the representative vector, and compress the original vector to obtain a compressed vector.

7. The method according to claim 6, wherein after determining a plurality of clusters and representative vectors of the clusters according to the vector data set, the method further comprises:

The clustering radii of a plurality of the clusters in the vector data set are obtained, and the maximum value of the clustering radii is used as the maximum radius of the vector data set.

8. The method according to claim 7, wherein obtaining the cluster radius of a plurality of the clusters in the vector data set comprises:

The distances between the representative vector of the cluster and the plurality of original vectors in the cluster are respectively calculated, and the maximum value among the plurality of distances is used as the cluster radius.

9. The method according to claim 6, wherein the compressed vector is obtained by compressing the original vector, comprising:

obtaining a plurality of first sub-vectors of a plurality of the original vectors in the vector data set;

Determine a plurality of sub-vector clusters and sub-center point vectors corresponding to the sub-vector clusters according to the plurality of first sub-vectors;

Determine the first sub-center point vector with the closest distance to the first sub-vector of the original vector, and determine the number of the first sub-center point vector as the compressed vector corresponding to the first sub-vector;

A combination of compressed vectors corresponding to each sub-vector in the original vector is determined as a compressed vector corresponding to the original vector.

10. A device for vector range retrieval, characterized in that the device comprises:

The acquisition module is used to obtain the query vector and retrieval radius;

a loading module for loading the graph index and the compressed vector from the first storage area to the second storage area, wherein the read/write speed of the first storage area is lower than that of the second storage area;

a target representative vector determination module, configured to determine, in the second storage area, at least one target representative vector that is closest to the query vector in the graph index according to at least the query vector and the retrieval radius;

A target compressed vector determination module, configured to obtain the compressed vector corresponding to the at least one target representative vector in the second storage area, and determine the compressed target according to the distance between the query vector and the compressed vector vector;

A target original vector determination module, configured to obtain the original vector corresponding to the target compressed vector from the first storage area according to the target compressed vector, and according to the distance between the query vector and the original vector Determine the target original vector.

11. A device for constructing a vector index, characterized in that the device comprises:

The acquisition module is used to acquire the vector dataset;

a clustering module for determining a plurality of clusters and representative vectors of the clusters according to the vector data set;

A graph index creation module for establishing graph indexes based on the representative vector;

A compression module, configured to obtain an original vector corresponding to the representative vector, and compress the original vector to obtain a compressed vector.

12. An electronic device, characterized in that, comprising:

memory for storing instructions for execution by one or more processors of the electronic device, and

The processor is one of the processors of the electronic device, and is used to execute the method for vector range retrieval described in any one of claims 1-5 or the method for constructing a vector index according to any one of claims 6-9 .

13. A readable storage medium, wherein an instruction is stored on the readable storage medium, and when the instruction is executed on an electronic device, the electronic device executes the vector range described in any one of claims 1-5 The method of retrieval or the method of constructing a vector index according to any one of claims 6-9.

14. A computer program product, comprising a computer program/instruction, characterized in that, when the computer program/instruction is executed by a processor, the method for vector range retrieval described in any one of claims 1-5 or claim 6 is implemented The method for constructing a vector index according to any one of -9.