CN110046268B

CN110046268B - High-dimensional space kNN query method based on inverted position sensitive hash index

Info

Publication number: CN110046268B
Application number: CN201910325257.7A
Authority: CN
Inventors: 季长清; 汪祖民
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2016-02-05
Filing date: 2016-02-05
Publication date: 2024-04-05
Anticipated expiration: 2036-02-05
Also published as: CN105760469A; CN105760469B; CN110046268A; CN110059208A

Abstract

The divisional application discloses a method for establishing a high-dimensional space kNN query based on an inverted position-sensitive hash index, belongs to the fields of big data and mobile application, solves the problem of reducing the cost of the high-dimensional index, requires more features for indexing pictures to determine a specific picture, requires similar data of other features, given another point P, the same data table is built, R is taken as radius, P is taken as circle center, adjacent data points are found, the same point M possibly appears in the nearest neighbors of P and q points, the probability that M is the characteristic data point of the index is particularly high, and the like, and hashing is carried out until final data are found, so that the query efficiency is improved.

Description

High-dimensional space kNN query method based on inverted position sensitive hash index

The application is a divisional application of the invention patent application with the application number of 201610083263.2 and the application date of 2016-02-05, and the invention is named as a high-dimensional approximate image retrieval method based on inverted LSH in a cloud computing environment.

Technical Field

The invention belongs to the application field of large-scale space-time data processing and mobile technology, and relates to a high-dimensional approximate image retrieval method based on inverted LSH in a cloud computing environment

Background

At present, the network basically covers the life of people, and the mobile phone surfing becomes a main surfing mode. By 2014 and 6 months, in the Internet surfing equipment of the national citizens, the mobile phone utilization rate reaches 83.4 percent, the mobile phone is superior to the integral utilization rate of 80.9 percent of the traditional PC (using a desktop computer and a notebook computer) for the first time, and the position of the mobile phone serving as the first large Internet surfing terminal equipment is more consolidated. The development of information technology is very rapid, the quantity of various forms of information is also rapidly increasing, along with the diversification and complexity of the user search requirements, the user is not satisfied with simple text search any more, and the image is taken as an important information carrier, so that the daily life is full of rich and various image information. For example, the user can see the favorite beautiful head portrait, want to find similar head portrait, see a piece of clothes or skirt, want to find similar money, etc. which is inconvenient to express by words, and has the searching requirement of picture reference.

The development of information technology is very rapid today, the amount of information in various forms is also rapidly increasing, and as the requirements for user retrieval are diversified and complicated, users are no longer satisfied with simple text retrieval, but are more inclined to the information retrieval of images. The daily life is full of rich and diverse image information. For example, the user may see a favorite avatar, want to find a similar style, or see a garment or skirt, want to find a similar try, etc. The situation is inconvenient to express by words, but the retrieval requirement of a user can be met very quickly by using pictures. Smart phones are of course indispensable as collectors of images. From the new data it is known that 2015 global smart phone users will reach 19.1 billion and 2016 the index will increase by 12.6% to 21.6 billion. Smartphones will gradually occupy the information communication market.

It has become a crucial research hotspot in today's image retrieval field to investigate how a user should quickly find his own desired information from so many choices of images, and how to provide a quick and efficient way to retrieve images. In the existing research work, it is common practice to extract high-dimensional features from high-dimensional data of an image according to a specific method (such as a sift operator commonly used in the image), and then build an index according to the features to increase the query speed. However, under different data characteristics, the vector dimension is usually up to tens or hundreds of dimensions, and the data volume of each dimension is very large, which requires that the high-dimensional index structure has better dimension expansibility, that is, the index can still maintain better performance with the increase of the dimension. Unfortunately, most conventional spatial indexing techniques now suffer from dimensional disasters, such as rtrees and Voronoi indexes, and in general, the current high-dimensional feature indexing techniques suffer from the following disadvantages: (1) Most traditional index structures have poor expansibility and encounter dimension disaster problems; (2) Most traditional indexing mechanisms make certain assumptions (such as uniform distribution) about data distribution when dividing data space, and are usually different from real distribution (such as oblique distribution, zipf normal distribution, etc.) of data; (3) Most high-dimensional index structures have high spatial and temporal complexity and poor accuracy.

Disclosure of Invention

In order to solve the problem that the existing position-sensitive hash index cannot adapt to the distributed index, the invention provides a high-dimensional approximate image retrieval method based on inverted LSH in a cloud computing environment, and the position-sensitive hash index can adapt to the distributed index.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a high-dimensional approximate image retrieval method based on inverted LSH in cloud computing environment comprises the following steps: the client acquires and extracts picture features and communicates with the cloud center service system; the cloud center service system establishes a distributed inverted index based on the position sensitive hash and inquires a neighbor image corresponding to the acquired picture.

The beneficial effects are that: because the cloud center service system establishes the inverted position-sensitive hash index, the position-sensitive hash index can adapt to distributed query, so that the problems of overlarge information quantity, inconsistent required information and display pictures and the like are solved, a user is helped to save the time of searching and querying as much as possible.

Drawings

FIG. 1 is a schematic diagram of a large-scale high-dimensional image retrieval based on inverted position-sensitive hash index in a cloud computing environment;

FIG. 2 illustrates a kNN algorithm process based on distributed inverted grid indexes of the present invention;

FIG. 3 is a functional block diagram of the present invention;

FIG. 4 is a flow chart of image retrieval of the present invention;

FIG. 5 is a flow chart for implementing image lookup in accordance with the present invention.

Detailed Description

Example 1:a high-dimensional approximate image retrieval method based on inverted LSH in cloud computing environment comprises the following steps: the client acquires and extracts picture features and communicates with the cloud center service system; cloud center service system establishes reverse-position-sensitive-based hash index and queries and inquires by utilizing cloud strong distribution computing capabilityAnd collecting neighbor images corresponding to the pictures.

Generally, the kNN algorithm is based on a distributed inverted index, but the position-sensitive hash-based index is not a distributed index, and in order to adapt to the distributed index, the position-sensitive hash-based distributed inverted index is established: in the technical scheme, when a position-sensitive Hash index is established, a plurality of Hash buckets are separated, the Hash buckets are used as keys, point sets in the Hash buckets are used as values, and MapReduce is used for distributed solving. According to the technical scheme, the index based on the position sensitive hash is analyzed into a Key-Value structure to adapt to the distributed index, so that the index can use a kNN algorithm to realize inquiry; and the distributed index can realize that adjacent points are collected to adjacent Hash buckets (similar data are hashed to the same area), so that the query speed can be increased.

In this technical solution, as a preferred option, the LSH regards the objects in the high-dimensional space as spatial data points with position information, and maps all the spatial object points to m hash tables T through a family of hash functions F () _i In the above, m= |f|, that is, each hash function F e F corresponds to one hash table, and each hash table stores all object points in the space. Giving a query point q, and respectively calculating the result value of each q point in the hash function: { f _1(q) ,f _2(q) …f _m(q) ,f _i E F, i=1, 2 … m }, all F _i(q) Fall into hash table T _i And taking the points in the bucket as candidate sets, calculating the distance between the points and q, and finally sorting out k nearest points to obtain a kNN result set.

The technical scheme is used for solving the problem of high-dimensional data retrieval in a mass data environment, a high-dimensional data vector is actually abstract of a text, then the high-dimensional vector is subjected to dimension reduction by utilizing a hash projection technology by combining the characteristics of a position sensitive hash function, the high-dimensional vector is used as a hash index value, the corresponding high-dimensional vector is used as a data record, the same hash operation is performed on other data, and the obtained result is subjected to screening and query optimization to obtain a final candidate result set.

Traditional LSH algorithms can only be performed on a single machine, but are limited by the lack of performance, computing and storage resources of a single computer node. This limitation is more severe with the advent of dimension disasters, especially at high dimensions.

Under large-scale data, the original spatial index designed for a single machine and a typical query algorithm also need to be redesigned and optimized under a distributed environment. The research on how to introduce the distributed processing and the spatial index optimization technology into the cloud computing is combined, and the problem of effectively searching the large-scale space is a new research point which needs to be further developed.

In the distributed technology, the distributed file system can scientifically store all multimedia data in all different computer nodes according to different network environments and other factors, and the multimedia data are interconnected through a network, so that the whole is realized, and the problem of large-scale data storage is solved. In addition, as all data are stored in different computer nodes, the data security, feasibility, read-write efficiency and other aspects of the whole system are greatly improved. The distributed technology has a mature model in terms of calculation, and can improve the calculation speed and corresponding capability of the system, such as the Google distributed file system which is already proposed and applied.

Based on the above points, the performance of the LSH algorithm can be well improved by using the distributed technology, and the application of the distributed technology is also required by the reality of modern network multimedia development.

Example 2:the present embodiment has the same technical scheme as that of embodiment 1, and more specifically is: the embodiment discloses a specific method for establishing a position sensitive hash-based distributed inverted index, which is characterized in that a data set is stored in an HDFS distributed file system in advance, when a task is started, a plurality of configuration file LSH hash function families are read in through a distributed caching mechanism, each Map task reads in a data fragment designated by a JobTracker as input, then hash mapping dimension reduction is carried out on each data object according to a given hash function, a hash value is obtained after a high-dimensional vector is subjected to hash mapping, the hash value is used as an index value, for example, for the high-dimensional vector v, a hash value hi (v) is obtained after the i-th hash function hi () is mapped, and finally<i#hi(v),v.id>The key value pair is output, so that the binary group can be obtained after calculation for each high-dimensional data vector, and the same operation is performed for each piece of data by using each hash function. The output of the Map process serves as the input of Reduce, all data objects of the same hash are collected together in Reduce, the data objects are separated by spaces, and finally the data objects are output to the HDFS distributed file system for storage as a result. In the process of constructing the distributed index, a Combine optimizing process can be added between the Map and the Reduce to Reduce transmission of intermediate data, and detailed pseudo codes for constructing the distributed index are as follows:

algorithm: LSH-based distributed index architecture

Input: high-dimensional dataset set S, hash function family H

And (3) outputting: high-dimensional index file

Step 1: mapper process

Step 2: combine process// optimization process

Step 3: reducer process

Example 3:this embodiment has the same technical scheme as embodiment 1 or 2, more specifically: the embodiment discloses a query method, wherein the query is a kNN query based on a position-sensitive hash distributed inverted index, and the steps are as follows: let the high-dimensional data set be S, S being a gallery of a large number of plants already in the image retrieval system, for example, each image in the gallery holding 128-dimensional Gao WeiteAnd (3) sign. The query object set is Q, Q is a query image object, for example, a shot image of a group of flowers, the feature set is formed after high-dimensional feature extraction, each query object Q belongs to Q, an initialization association function h, h belongs to G, G is a hash family, LSH is a multi-round hash algorithm, and different hash functions can obtain different hash results. h corresponds to the set of similarity points of q, the radius set r= getCandidates (hashvalue), the hash value is a hash value, and the hash algorithm maps binary values of arbitrary length to smaller binary values of fixed length, which are called hash values. Different hash values can yield different hash results, where R is the bucket width.

A hash collision with Guan Haxi function is performed in a region with a radius r, each object of the hash is hash value=computer (q, h), and each hash value is obtained through a hash calculation function, that is, the Computer calculates. For example, taking the remainder is the simplest calculation method, collision is performed for multiple times until all data are hashed, and an object d=computer (q, c) with relatively high collision times is screened out, wherein the probability that the object d is a nearby point of the object q is highest. I.e. d is the object with the highest probability of collision. To reduce the amount of intermediate data of MapReduce, we have adopted a method based on hash collision count, replacing the actual value with collision count statistics. The number of times of collision is counted as a final sorting result, so that the intermediate data volume can be greatly reduced, and the processing speed of MapReduce is increased.

Preferably, the collision region is given an error calibration range (1+θ) r. When hash collision is carried out, the number of collisions is counted as a final sorting basis for the hash function F (h), and the more the number of collisions is, the more the sorting is, the earlier the collision is, and the collision region is given an error calibration range (1+theta) r because the r region has a certain error. And due to the existence of a system error, the collision needs to be carried out for a plurality of times to screen the final result: the collision occurs between q and adjacent data, and at the same time, a small part of non-adjacent data collides with the data; then, performing second collision and third collision until all data are hashed, screening out points d=computer (q, c) with relatively high collision times, wherein the probability of the points being the adjacent points of q is highest, and then sorting out the data for integration.

Example 4:this embodiment has the same technical scheme as embodiment 1 or 2 or 3, more specifically: and the client extracts the high-dimensional characteristics of the picture, generates high-dimensional characteristic data of the picture, and transmits the high-dimensional characteristic data of the picture to the cloud center service system. Dimensions are typically extracted in more than 128 dimensions, experiments are performed at 128X128 dimensions, and large scale typically refers to greater than 10G-several hundred T. The experiment is a sample test performed under 16G data.

The scheme has the following effects:

(1) The application adopts a design mode of intelligent selection of the connecting line, so that the mobile phone can select a more proper cloud computing server to transmit data. The software is installed on the mobile phone and the server, the client is installed on the software of the smart mobile phone, once a user transmits a certain required similar picture to the mobile phone software, the software firstly extracts the data characteristics of the picture, and then the data is simply matched by utilizing partial data stored in the memory of the mobile phone. If the memory of the mobile phone contains the needed data, the software converts the data into corresponding pictures to be displayed on the software; if the corresponding data does not exist in the memory data of the mobile phone, the software is connected with the server by using the 2G, the 3G and the 4G, WIFI, the picture characteristic data is transmitted to the server, the LSRP-tree retrieval index is carried out in the server, and the data is fed back to the mobile phone software after the completion, and the result is displayed.

(2) Cloud computing is a resource sharing platform based on the Internet of things, and the platform can be provided for computers and other devices as required through shared software and hardware resources and information, so that a large-scale high-dimensional image retrieval system based on inverted position-sensitive hash indexes

The cloud computing characteristic is utilized to provide a powerful data processing system for the image characteristic query, and the support of the powerful data processing system can enable the mobile phone to present the matched images required by a user under the condition of limited hardware facilities.

(3) The data processed by the mobile phone after the user uploads the picture to the software is simply memory reproduction, and the mobile phone memory is limited after all, so the picture index can not basically meet the user requirement. The real picture index is to transmit the picture feature data to the cloud server by using the network. For the server, a massive data set is divided into a plurality of sub data sets through MapReduce in advance, tasks of the sub data sets are scheduled, then a distributed inversion space high-dimensional index based on LSH is established, and a final result is obtained through carrying out multi-round local sensitive hash dynamic collision detection algorithm calculation on similar data hashed to the same area.

The embodiment discloses a large-scale high-dimensional image retrieval system based on inverted position sensitive hash indexes in a cloud computing environment, and belongs to the application field of large-scale space-time data processing and mobile technology. In the system, a new index structure (LSRP-tree) is formed by combining LSH and RP-tree, so that the index cost is reduced in high-dimensional data query, and the query quality and the query efficiency are improved; the new algorithm (H-c 2 kNN) formed by combining LSH and MapReduce shows good expansibility and high efficiency. Both innovative applications practically solve the problem of approximate retrieval in high-dimensional data space. The collision counting method based on the hash collision is adopted, the collision counting statistics is used for replacing an actual value, the number of collisions is counted, and the number of collisions is used as a final sequencing result, so that the intermediate data volume can be greatly reduced, and the processing speed of MapReduce is increased. Cloud computing provides a very convenient communication platform for data communication under the condition of limited hardware. The invention relates to a system for searching pictures by utilizing an intelligent mobile platform, which comprises a group of cloud servers and a mobile client, in particular to software installed on the intelligent mobile platform (such as a smart phone or a tablet personal computer) for users to use respectively. The client comprises basic functions of gallery, photo shooting, transmission, picture scanning and the like, and the cloud server is responsible for controlling the whole picture searching process and processing related data (including building of LSH indexes, distributed kNN inquiry and the like) and performs image retrieval by extracting feature vectors. The invention effectively and advantageously solves the problems of overlarge information quantity, inconsistent information and pictures and the like, and helps users save time as much as possible, maximally solves the problem of eliminating massive information, and simplifies and rationalizes the utilization of resources. Meets the further demand of people for mobile information retrieval intellectualization.

Example 5:a large-scale high-dimensional image retrieval system based on inverted position sensitive hash index in a cloud computing environment comprises a cloud center service system and an intelligent mobile client. The cloud center service system is used for establishing a hash algorithm and executing inverted grid indexes, the intelligent mobile client is used for collecting pictures and sending the information to the cloud center service system through a wireless network, and the intelligent mobile client is also used for receiving the best pictures returned by the cloud center service system. The invention also improves the defects in the space data indexing and inquiring method in the prior art, solves the problem of mass information screening to the maximum extent, simplifies and rationalizes the utilization of resources, and meets the further desire of people for resource intellectualization. The basis of the application is an LSRP-tree index structure and an H-c2kNN query algorithm which are respectively generated by combining Location Sensitive Hash (LSH) with MapReduce and RP-tree. The system may perform a retrieval method.

Example 6:the technical scheme is the same as that of embodiment 5, when a user takes a picture with a mobile phone or acquires the picture through a wireless network, a corresponding search engine is applied to upload the picture, a cloud center service system automatically extracts the characteristics of color, shape, texture and the like of an image by using an image analysis program, the extracted picture feature vectors are subjected to data analysis and matching by using an established inverted grid index based on a hash algorithm, k nearest neighbors are returned according to the matching precision requirement, and then corresponding k images are found according to the k vectors, and information is fed back to a user side in time.

Example 7:the processing method of the position sensitive hash has the same technical scheme as that of the embodiment 6, and the processing method comprises the following steps: firstly, establishing a distributed inverted space high-dimensional index based on LSH through massive data sets, taking a hashed bucket as a Key, taking a point set in the bucket as a Value, and then carrying out MapReduceAnd (5) carrying out distributed solving. And then, combining a plurality of rounds of local sensitive hash and a dynamic collision frequency detection algorithm, and utilizing MapReduce to screen and query and optimize results. And finally, a result set is obtained, and meanwhile, adjacent points can be collected in adjacent barrels by using the inverted index, and the approximate searching speed can be increased by detecting the adjacent barrels.

In order to reduce the intermediate data volume of MapReduce, a method based on Hash collision counting is adopted, the actual value is replaced by collision counting statistics, the number of collisions is counted, and the final sorting result is used, so that the intermediate data volume can be greatly reduced, and the processing speed of MapReduce is increased.

The embodiment adopts a large-scale distributed hash algorithm and other space data processing algorithms, integrates a large-data processing mode into the query stage of the large-scale high-dimensional image retrieval system by analyzing the characteristics of the color, the shape, the texture and the like of the images, searches the optimal images in mass data, feeds back the image information to a user side, and finally completes the image retrieval problem. The effect is as follows: by analyzing the characteristics of the color, shape, texture and the like of the picture, the best approximate matching picture is provided for the user.

The large-scale high-dimensional image retrieval described in this embodiment has the following structure and benefits:

(1) The application adopts a design mode of intelligent selection of a connecting line, so that a mobile phone selects a more appropriate cloud computing server to transmit data; the software is installed on the mobile phone and the server, the client is installed on the software of the smart mobile phone, once a user transmits a certain required similar picture to the mobile phone software, the software firstly extracts the data characteristics of the picture, and then the data is simply matched by utilizing partial data stored in the memory of the mobile phone. If the memory of the mobile phone contains the needed data, the software converts the data into corresponding pictures to be displayed on the software; if the corresponding data does not exist in the memory data of the mobile phone, the software is connected with the server by using the 2G, the 3G and the 4G, WIFI, the picture characteristic data is transmitted to the server, the LSRP-tree retrieval index is carried out in the server, and the data is fed back to the mobile phone software after the completion, and the result is displayed.

(2) Cloud computing is a resource sharing platform based on the Internet of things, and the platform can be provided for computers and other devices as required through shared software and hardware resources and information. Therefore, the large-scale high-dimensional image retrieval system based on the inverted position-sensitive hash index utilizes the characteristic of cloud computing to provide a powerful data processing system for inquiring the picture characteristics, and the support of the powerful data processing system can enable the mobile phone to present the matched picture required by a user under the condition of limited hardware facilities.

(3) The data processed by the mobile phone after the user uploads the picture to the software is simply memory reproduction, and the mobile phone memory is limited after all, so the picture index can not basically meet the user requirement. The real picture index is to transmit the picture feature data to a cloud server by using a network, and for the server, a massive data set is divided into a plurality of sub data sets by MapReduce in advance, tasks of the sub data sets are scheduled, then a distributed inverted space high-dimensional index based on LSH is established, and a final result is obtained by carrying out multi-round local sensitive hash dynamic collision detection algorithm calculation by hashing similar data to the same area.

Example 8:the present embodiment gives a definition of kNN queries.

The formalized definition of a kNN query is given below: given a set of spatial data points P, query point q and an integer k (k > 0), k-nearest neighbor queries find a set kNN of k data points, and satisfy dist (P ', q). Ltoreq.dist (P, q) for any P' ∈kNN and any P ε P-kNN.

For picture indexes in a high-dimensional environment, the H-c2kNN algorithm formed by combining the LSH index and the MapReduce is not lost as a preferred algorithm. One important source of spatially large-scale data, high-dimensional data, is the high-speed growth of data. How to make kNN queries on large-scale spatial data under high-dimensional conditions is an important direction of interest in recent years.

The following steps are further explanation of hashvalue=computer (q, h) in embodiment 2, r= getCandidates (hashvalue), and algorithm execution is the calculation performed by (1) initialization (2) in step 1 in embodiment 2.

Setting a group of hash functions F (), calculating and mapping all objects in space to m hash tables T _i In the above, m= |f|, that is, each hash function F belongs to one hash table corresponding to F, and each hash table stores all object points in the space. Given a query point q, function result values of each hash function to q point mapping are calculated respectively.

The following steps correspond to the d=computer (q, c) calculation collision algorithm in embodiment 3, which corresponds to step 2 in embodiment 2, and is an optimization process. Example 2 is actually an algorithm execution process, and example 3 is a theoretical analysis of the algorithm, i.e., a mathematical operation process.

In performing a hash collision, for the hash function F (H), any M, M 'is found, and satisfying H (M) =h (M') presents a computational difficulty, then H (M) is referred to as a strong one-way hash function or a collision-free function, also commonly referred to as a hash function. Therefore, we replace the actual value with the hash collision number that is easier to perform. The number of times of collision is counted as a final sorting result, so that the amount of intermediate data can be greatly reduced, the MapReduce processing speed is increased, and the processing result is fed back to a user more quickly.

For server operation, data differentiation and data mirroring are utilized, and indiscriminate data service is provided by a cloud computing providing technology.

Example 9:referring to fig. 1, a schematic diagram of a large-scale high-dimensional image retrieval system based on an inverted position-sensitive hash index in a cloud computing environment is described as follows: establishing a distributed inverted space high-dimensional index based on LSH to perform knN query, setting a data set as S and a query object as Q, initializing a similar point set of a correlation function h, h belongs to G and the corresponding Q for each Q, and performing Guan Haxi function collision in a region with a radius of R, wherein each point of the hash is hashvalue=computer (Q, h), and an error calibration range (1+theta) R is given to the collision region because of a certain error in the R region. And due to the existence of system errors, the collision needs to be carried out for multiple times to screen the final productResults: the collision occurs between q and adjacent data, and at the same time, a small part of non-adjacent data collides with the data; then, performing second collision and third collision until all data are hashed, screening out points d=computer (q, c) with relatively high collision times, wherein the probability of the points being the adjacent points of q is highest, and then sorting out the data for integration.

Example 10:referring to fig. 2, in the kNN query index of the large-scale high-dimensional image, since kNN algorithm is based on a distributed inverted grid index, we set the grid cell size as "g", and set a query point q, and use Aq to represent the grid where q points are located, use r as radius, use q as center of circle to make a circle, and P (xi, yi) represents the nearest distance of the q points in. For a specific hash function, a point q is given, when the adjacent point is calculated, the function is firstly divided into a plurality of 'barrel' keys (referring to the existence area of the specific relation, which is expressed by hv1, hv2, hv3, hvN), then the function is correspondingly hashed into the specific barrel by the relation between the data and the hash function, and meanwhile, q collides with the data in the barrel with the corresponding relation, namely, key i= hvi =g (i), and the data with higher times of collision with q are screened out to integrate, so that the similar data of a certain specific point can be obtained.

For the index of a picture, more features are needed to determine a specific picture, so that similar data of other features are required to be obtained, for example, given another point P, similar data are also built in the same data table, R is taken as a radius, P is taken as a center of a circle, a neighbor data point is found, the nearest neighbor of P and q points possibly has the same point M, the probability that M is the feature data point of the index is particularly high, and the like, and hashing is performed until final data are found.

Example 11:referring to fig. 3, in consideration of the portable characteristic of the mobile terminal, the limitation of software and hardware resources and the advantages of cloud computing, the large-scale high-dimensional image retrieval system based on the inverted position-sensitive hash index uses a thin client mode of a C/S architecture, a cloud server is responsible for main data processing work, a client only needs to simply send a required chart, receive and display a result, and a handheld device client is used for processing the data in a 2G/3G/4G mode or based on the dataThe wireless network of the WIFI is accessed to the mobile internet to establish a connection with the cloud server, the client is responsible for displaying pictures and carrying relevant parameters, such as picture feature data information, to send a request to the cloud server, after a mobile phone logs in, the cloud server sends picture features to the cloud server, the cloud server adopts a high-dimensional space mass data distributed space inverted index technology designed by us, namely, an LSH technology sends picture feature data to the cloud server, and the cloud server rapidly searches pictures required by a user from massive charts by adopting a parallelization kNN query technology. And inquiring the client software interface to which the picture is sent, thereby completing the requirement of the user.

Example 12:referring to fig. 4, the present invention is a system for searching pictures by using an intelligent mobile platform, which includes a cloud server and a mobile client, specifically, software installed on the intelligent mobile platform (such as a smart phone or a tablet computer) for users to use respectively. The client comprises basic functions of gallery, photo shooting, transmission, picture scanning and the like, and the cloud server is responsible for controlling the whole picture searching process and processing related data (including establishing an LSH index, inquiring distributed kNN and the like).

Example 13:referring to fig. 5, the steps involved in implementing the graph lookup by the present invention are as follows: the user obtains the required similar pictures through shooting or other ways, scans through the client, the mobile phone client finds out the picture characteristics and uploads the picture characteristics to the cloud server, the cloud server performs data processing through the uploaded picture characteristic data, finds out the similar characteristic picture data, then returns the data to the client, and the client indexes the pictures to be displayed on a software interface so as to complete tasks. After the pictures are displayed on the client interface, picture numbers, source links and the like are displayed, so that a user can know chart information more deeply.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims

1. A high-dimensional space kNN query method based on inverted position sensitive hash index is established, which is characterized in that: the method comprises the following steps: setting the size of a grid unit as a front-name, setting a query point q, using Aq to represent the grid where the q point is located, using r as a radius, q as a circle center, for a specific Hash function, setting a point q, when finding adjacent points, dividing the function into a plurality of Hash barrels, using the Hash barrels as keys, using a point set in the Hash barrels as Value, using the Hash barrels to refer to a region with a specific relation, using hv1, hv2 and hv3 to hvN to represent the Hash, using the relation between the data and the Hash function to Hash into the specific barrel, and simultaneously enabling the q to collide with the data in the barrel with the corresponding relation, screening out the data with higher q collision times, and integrating the data with the specific point to obtain the similar data of the specific point; for the index of a picture, more features are needed to determine a specific picture, similar data of other features are required to be obtained, another point P is given, the similar data is also built in the same data table, R is taken as a radius, P is taken as a circle center, adjacent data points are found, the same point M possibly appears in the nearest neighbor of P and q points, the probability that M is the index feature data point is particularly high, and the like is hashed until final data are found;

the method for screening the data with higher q collision times comprises the following steps:

setting a high-dimensional data set as S, wherein S is an existing large-scale image library in an image retrieval system, a query object set is Q, Q is a query image object, a feature set is formed after high-dimensional feature extraction is performed, for each query object Q belongs to Q, an association function h is initialized, h belongs to G, G is a hash family, LSH is a multi-round hash algorithm, different hash functions can obtain different hash results, h corresponds to a Q similarity point set, a radius set R= getCandidates (hashvalue), hash value is a hash value, different hash values are obtained, and R is a bucket width; carrying out hash collision with Guan Haxi function in a certain radius r, wherein each object in the hash is hash value=computer (q, h), each hash value is obtained through a hash calculation function, the hash is carried out for multiple times until all data are hashed, objects d=computer (q, c) with relatively high collision times are screened out, the probability that the object d is the nearest point of the object q is highest, the object d is the object with highest collision probability, the actual value is replaced by collision count statistics, the number of times of collision is taken as the final sequencing result, the intermediate data quantity is reduced, and the processing speed of MapReduce is accelerated;

in order to correct the error of the area with the radius r, an error calibration range (1+theta) r is set for the collision area; when hash collision is carried out, counting the number of times of collision as a final sorting basis for a hash function F (h), wherein the more the number of times of collision is, the more the sorting is, and the collision needs to be carried out for a plurality of times to screen a final result due to the existence of a system error: the collision occurs between q and adjacent data, and at the same time, a small part of non-adjacent data collides with the data; then performing second collision and third collision until all data are hashed, screening out points d=computer (q, c) with relatively high collision times, wherein the probability of the points being the adjacent points of q is highest, and then sorting out the data for integration.