[go: up one dir, main page]

CN110046268B - High-dimensional space kNN query method based on inverted position sensitive hash index - Google Patents

High-dimensional space kNN query method based on inverted position sensitive hash index Download PDF

Info

Publication number
CN110046268B
CN110046268B CN201910325257.7A CN201910325257A CN110046268B CN 110046268 B CN110046268 B CN 110046268B CN 201910325257 A CN201910325257 A CN 201910325257A CN 110046268 B CN110046268 B CN 110046268B
Authority
CN
China
Prior art keywords
data
hash
collision
index
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910325257.7A
Other languages
Chinese (zh)
Other versions
CN110046268A (en
Inventor
季长清
汪祖民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN201910325257.7A priority Critical patent/CN110046268B/en
Publication of CN110046268A publication Critical patent/CN110046268A/en
Application granted granted Critical
Publication of CN110046268B publication Critical patent/CN110046268B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The divisional application discloses a method for establishing a high-dimensional space kNN query based on an inverted position-sensitive hash index, belongs to the fields of big data and mobile application, solves the problem of reducing the cost of the high-dimensional index, requires more features for indexing pictures to determine a specific picture, requires similar data of other features, given another point P, the same data table is built, R is taken as radius, P is taken as circle center, adjacent data points are found, the same point M possibly appears in the nearest neighbors of P and q points, the probability that M is the characteristic data point of the index is particularly high, and the like, and hashing is carried out until final data are found, so that the query efficiency is improved.

Description

High-dimensional space kNN query method based on inverted position sensitive hash index
The application is a divisional application of the invention patent application with the application number of 201610083263.2 and the application date of 2016-02-05, and the invention is named as a high-dimensional approximate image retrieval method based on inverted LSH in a cloud computing environment.
Technical Field
The invention belongs to the application field of large-scale space-time data processing and mobile technology, and relates to a high-dimensional approximate image retrieval method based on inverted LSH in a cloud computing environment
Background
At present, the network basically covers the life of people, and the mobile phone surfing becomes a main surfing mode. By 2014 and 6 months, in the Internet surfing equipment of the national citizens, the mobile phone utilization rate reaches 83.4 percent, the mobile phone is superior to the integral utilization rate of 80.9 percent of the traditional PC (using a desktop computer and a notebook computer) for the first time, and the position of the mobile phone serving as the first large Internet surfing terminal equipment is more consolidated. The development of information technology is very rapid, the quantity of various forms of information is also rapidly increasing, along with the diversification and complexity of the user search requirements, the user is not satisfied with simple text search any more, and the image is taken as an important information carrier, so that the daily life is full of rich and various image information. For example, the user can see the favorite beautiful head portrait, want to find similar head portrait, see a piece of clothes or skirt, want to find similar money, etc. which is inconvenient to express by words, and has the searching requirement of picture reference.
The development of information technology is very rapid today, the amount of information in various forms is also rapidly increasing, and as the requirements for user retrieval are diversified and complicated, users are no longer satisfied with simple text retrieval, but are more inclined to the information retrieval of images. The daily life is full of rich and diverse image information. For example, the user may see a favorite avatar, want to find a similar style, or see a garment or skirt, want to find a similar try, etc. The situation is inconvenient to express by words, but the retrieval requirement of a user can be met very quickly by using pictures. Smart phones are of course indispensable as collectors of images. From the new data it is known that 2015 global smart phone users will reach 19.1 billion and 2016 the index will increase by 12.6% to 21.6 billion. Smartphones will gradually occupy the information communication market.
It has become a crucial research hotspot in today's image retrieval field to investigate how a user should quickly find his own desired information from so many choices of images, and how to provide a quick and efficient way to retrieve images. In the existing research work, it is common practice to extract high-dimensional features from high-dimensional data of an image according to a specific method (such as a sift operator commonly used in the image), and then build an index according to the features to increase the query speed. However, under different data characteristics, the vector dimension is usually up to tens or hundreds of dimensions, and the data volume of each dimension is very large, which requires that the high-dimensional index structure has better dimension expansibility, that is, the index can still maintain better performance with the increase of the dimension. Unfortunately, most conventional spatial indexing techniques now suffer from dimensional disasters, such as rtrees and Voronoi indexes, and in general, the current high-dimensional feature indexing techniques suffer from the following disadvantages: (1) Most traditional index structures have poor expansibility and encounter dimension disaster problems; (2) Most traditional indexing mechanisms make certain assumptions (such as uniform distribution) about data distribution when dividing data space, and are usually different from real distribution (such as oblique distribution, zipf normal distribution, etc.) of data; (3) Most high-dimensional index structures have high spatial and temporal complexity and poor accuracy.
Disclosure of Invention
In order to solve the problem that the existing position-sensitive hash index cannot adapt to the distributed index, the invention provides a high-dimensional approximate image retrieval method based on inverted LSH in a cloud computing environment, and the position-sensitive hash index can adapt to the distributed index.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a high-dimensional approximate image retrieval method based on inverted LSH in cloud computing environment comprises the following steps: the client acquires and extracts picture features and communicates with the cloud center service system; the cloud center service system establishes a distributed inverted index based on the position sensitive hash and inquires a neighbor image corresponding to the acquired picture.
The beneficial effects are that: because the cloud center service system establishes the inverted position-sensitive hash index, the position-sensitive hash index can adapt to distributed query, so that the problems of overlarge information quantity, inconsistent required information and display pictures and the like are solved, a user is helped to save the time of searching and querying as much as possible.
Drawings
FIG. 1 is a schematic diagram of a large-scale high-dimensional image retrieval based on inverted position-sensitive hash index in a cloud computing environment;
FIG. 2 illustrates a kNN algorithm process based on distributed inverted grid indexes of the present invention;
FIG. 3 is a functional block diagram of the present invention;
FIG. 4 is a flow chart of image retrieval of the present invention;
FIG. 5 is a flow chart for implementing image lookup in accordance with the present invention.
Detailed Description
Example 1:a high-dimensional approximate image retrieval method based on inverted LSH in cloud computing environment comprises the following steps: the client acquires and extracts picture features and communicates with the cloud center service system; cloud center service system establishes reverse-position-sensitive-based hash index and queries and inquires by utilizing cloud strong distribution computing capabilityAnd collecting neighbor images corresponding to the pictures.
Generally, the kNN algorithm is based on a distributed inverted index, but the position-sensitive hash-based index is not a distributed index, and in order to adapt to the distributed index, the position-sensitive hash-based distributed inverted index is established: in the technical scheme, when a position-sensitive Hash index is established, a plurality of Hash buckets are separated, the Hash buckets are used as keys, point sets in the Hash buckets are used as values, and MapReduce is used for distributed solving. According to the technical scheme, the index based on the position sensitive hash is analyzed into a Key-Value structure to adapt to the distributed index, so that the index can use a kNN algorithm to realize inquiry; and the distributed index can realize that adjacent points are collected to adjacent Hash buckets (similar data are hashed to the same area), so that the query speed can be increased.
In this technical solution, as a preferred option, the LSH regards the objects in the high-dimensional space as spatial data points with position information, and maps all the spatial object points to m hash tables T through a family of hash functions F () i In the above, m= |f|, that is, each hash function F e F corresponds to one hash table, and each hash table stores all object points in the space. Giving a query point q, and respectively calculating the result value of each q point in the hash function: { f 1(q) ,f 2(q) …f m(q) ,f i E F, i=1, 2 … m }, all F i(q) Fall into hash table T i And taking the points in the bucket as candidate sets, calculating the distance between the points and q, and finally sorting out k nearest points to obtain a kNN result set.
The technical scheme is used for solving the problem of high-dimensional data retrieval in a mass data environment, a high-dimensional data vector is actually abstract of a text, then the high-dimensional vector is subjected to dimension reduction by utilizing a hash projection technology by combining the characteristics of a position sensitive hash function, the high-dimensional vector is used as a hash index value, the corresponding high-dimensional vector is used as a data record, the same hash operation is performed on other data, and the obtained result is subjected to screening and query optimization to obtain a final candidate result set.
Traditional LSH algorithms can only be performed on a single machine, but are limited by the lack of performance, computing and storage resources of a single computer node. This limitation is more severe with the advent of dimension disasters, especially at high dimensions.
Under large-scale data, the original spatial index designed for a single machine and a typical query algorithm also need to be redesigned and optimized under a distributed environment. The research on how to introduce the distributed processing and the spatial index optimization technology into the cloud computing is combined, and the problem of effectively searching the large-scale space is a new research point which needs to be further developed.
In the distributed technology, the distributed file system can scientifically store all multimedia data in all different computer nodes according to different network environments and other factors, and the multimedia data are interconnected through a network, so that the whole is realized, and the problem of large-scale data storage is solved. In addition, as all data are stored in different computer nodes, the data security, feasibility, read-write efficiency and other aspects of the whole system are greatly improved. The distributed technology has a mature model in terms of calculation, and can improve the calculation speed and corresponding capability of the system, such as the Google distributed file system which is already proposed and applied.
Based on the above points, the performance of the LSH algorithm can be well improved by using the distributed technology, and the application of the distributed technology is also required by the reality of modern network multimedia development.
Example 2:the present embodiment has the same technical scheme as that of embodiment 1, and more specifically is: the embodiment discloses a specific method for establishing a position sensitive hash-based distributed inverted index, which is characterized in that a data set is stored in an HDFS distributed file system in advance, when a task is started, a plurality of configuration file LSH hash function families are read in through a distributed caching mechanism, each Map task reads in a data fragment designated by a JobTracker as input, then hash mapping dimension reduction is carried out on each data object according to a given hash function, a hash value is obtained after a high-dimensional vector is subjected to hash mapping, the hash value is used as an index value, for example, for the high-dimensional vector v, a hash value hi (v) is obtained after the i-th hash function hi () is mapped, and finally<i#hi(v),v.id>The key value pair is output, so that the binary group can be obtained after calculation for each high-dimensional data vector, and the same operation is performed for each piece of data by using each hash function. The output of the Map process serves as the input of Reduce, all data objects of the same hash are collected together in Reduce, the data objects are separated by spaces, and finally the data objects are output to the HDFS distributed file system for storage as a result. In the process of constructing the distributed index, a Combine optimizing process can be added between the Map and the Reduce to Reduce transmission of intermediate data, and detailed pseudo codes for constructing the distributed index are as follows:
algorithm: LSH-based distributed index architecture
Input: high-dimensional dataset set S, hash function family H
And (3) outputting: high-dimensional index file
Step 1: mapper process
Step 2: combine process// optimization process
Step 3: reducer process
Example 3:this embodiment has the same technical scheme as embodiment 1 or 2, more specifically: the embodiment discloses a query method, wherein the query is a kNN query based on a position-sensitive hash distributed inverted index, and the steps are as follows: let the high-dimensional data set be S, S being a gallery of a large number of plants already in the image retrieval system, for example, each image in the gallery holding 128-dimensional Gao WeiteAnd (3) sign. The query object set is Q, Q is a query image object, for example, a shot image of a group of flowers, the feature set is formed after high-dimensional feature extraction, each query object Q belongs to Q, an initialization association function h, h belongs to G, G is a hash family, LSH is a multi-round hash algorithm, and different hash functions can obtain different hash results. h corresponds to the set of similarity points of q, the radius set r= getCandidates (hashvalue), the hash value is a hash value, and the hash algorithm maps binary values of arbitrary length to smaller binary values of fixed length, which are called hash values. Different hash values can yield different hash results, where R is the bucket width.
A hash collision with Guan Haxi function is performed in a region with a radius r, each object of the hash is hash value=computer (q, h), and each hash value is obtained through a hash calculation function, that is, the Computer calculates. For example, taking the remainder is the simplest calculation method, collision is performed for multiple times until all data are hashed, and an object d=computer (q, c) with relatively high collision times is screened out, wherein the probability that the object d is a nearby point of the object q is highest. I.e. d is the object with the highest probability of collision. To reduce the amount of intermediate data of MapReduce, we have adopted a method based on hash collision count, replacing the actual value with collision count statistics. The number of times of collision is counted as a final sorting result, so that the intermediate data volume can be greatly reduced, and the processing speed of MapReduce is increased.
Preferably, the collision region is given an error calibration range (1+θ) r. When hash collision is carried out, the number of collisions is counted as a final sorting basis for the hash function F (h), and the more the number of collisions is, the more the sorting is, the earlier the collision is, and the collision region is given an error calibration range (1+theta) r because the r region has a certain error. And due to the existence of a system error, the collision needs to be carried out for a plurality of times to screen the final result: the collision occurs between q and adjacent data, and at the same time, a small part of non-adjacent data collides with the data; then, performing second collision and third collision until all data are hashed, screening out points d=computer (q, c) with relatively high collision times, wherein the probability of the points being the adjacent points of q is highest, and then sorting out the data for integration.
Example 4:this embodiment has the same technical scheme as embodiment 1 or 2 or 3, more specifically: and the client extracts the high-dimensional characteristics of the picture, generates high-dimensional characteristic data of the picture, and transmits the high-dimensional characteristic data of the picture to the cloud center service system. Dimensions are typically extracted in more than 128 dimensions, experiments are performed at 128X128 dimensions, and large scale typically refers to greater than 10G-several hundred T. The experiment is a sample test performed under 16G data.
The scheme has the following effects:
(1) The application adopts a design mode of intelligent selection of the connecting line, so that the mobile phone can select a more proper cloud computing server to transmit data. The software is installed on the mobile phone and the server, the client is installed on the software of the smart mobile phone, once a user transmits a certain required similar picture to the mobile phone software, the software firstly extracts the data characteristics of the picture, and then the data is simply matched by utilizing partial data stored in the memory of the mobile phone. If the memory of the mobile phone contains the needed data, the software converts the data into corresponding pictures to be displayed on the software; if the corresponding data does not exist in the memory data of the mobile phone, the software is connected with the server by using the 2G, the 3G and the 4G, WIFI, the picture characteristic data is transmitted to the server, the LSRP-tree retrieval index is carried out in the server, and the data is fed back to the mobile phone software after the completion, and the result is displayed.
(2) Cloud computing is a resource sharing platform based on the Internet of things, and the platform can be provided for computers and other devices as required through shared software and hardware resources and information, so that a large-scale high-dimensional image retrieval system based on inverted position-sensitive hash indexes
The cloud computing characteristic is utilized to provide a powerful data processing system for the image characteristic query, and the support of the powerful data processing system can enable the mobile phone to present the matched images required by a user under the condition of limited hardware facilities.
(3) The data processed by the mobile phone after the user uploads the picture to the software is simply memory reproduction, and the mobile phone memory is limited after all, so the picture index can not basically meet the user requirement. The real picture index is to transmit the picture feature data to the cloud server by using the network. For the server, a massive data set is divided into a plurality of sub data sets through MapReduce in advance, tasks of the sub data sets are scheduled, then a distributed inversion space high-dimensional index based on LSH is established, and a final result is obtained through carrying out multi-round local sensitive hash dynamic collision detection algorithm calculation on similar data hashed to the same area.
The embodiment discloses a large-scale high-dimensional image retrieval system based on inverted position sensitive hash indexes in a cloud computing environment, and belongs to the application field of large-scale space-time data processing and mobile technology. In the system, a new index structure (LSRP-tree) is formed by combining LSH and RP-tree, so that the index cost is reduced in high-dimensional data query, and the query quality and the query efficiency are improved; the new algorithm (H-c 2 kNN) formed by combining LSH and MapReduce shows good expansibility and high efficiency. Both innovative applications practically solve the problem of approximate retrieval in high-dimensional data space. The collision counting method based on the hash collision is adopted, the collision counting statistics is used for replacing an actual value, the number of collisions is counted, and the number of collisions is used as a final sequencing result, so that the intermediate data volume can be greatly reduced, and the processing speed of MapReduce is increased. Cloud computing provides a very convenient communication platform for data communication under the condition of limited hardware. The invention relates to a system for searching pictures by utilizing an intelligent mobile platform, which comprises a group of cloud servers and a mobile client, in particular to software installed on the intelligent mobile platform (such as a smart phone or a tablet personal computer) for users to use respectively. The client comprises basic functions of gallery, photo shooting, transmission, picture scanning and the like, and the cloud server is responsible for controlling the whole picture searching process and processing related data (including building of LSH indexes, distributed kNN inquiry and the like) and performs image retrieval by extracting feature vectors. The invention effectively and advantageously solves the problems of overlarge information quantity, inconsistent information and pictures and the like, and helps users save time as much as possible, maximally solves the problem of eliminating massive information, and simplifies and rationalizes the utilization of resources. Meets the further demand of people for mobile information retrieval intellectualization.
Example 5:a large-scale high-dimensional image retrieval system based on inverted position sensitive hash index in a cloud computing environment comprises a cloud center service system and an intelligent mobile client. The cloud center service system is used for establishing a hash algorithm and executing inverted grid indexes, the intelligent mobile client is used for collecting pictures and sending the information to the cloud center service system through a wireless network, and the intelligent mobile client is also used for receiving the best pictures returned by the cloud center service system. The invention also improves the defects in the space data indexing and inquiring method in the prior art, solves the problem of mass information screening to the maximum extent, simplifies and rationalizes the utilization of resources, and meets the further desire of people for resource intellectualization. The basis of the application is an LSRP-tree index structure and an H-c2kNN query algorithm which are respectively generated by combining Location Sensitive Hash (LSH) with MapReduce and RP-tree. The system may perform a retrieval method.
Example 6:the technical scheme is the same as that of embodiment 5, when a user takes a picture with a mobile phone or acquires the picture through a wireless network, a corresponding search engine is applied to upload the picture, a cloud center service system automatically extracts the characteristics of color, shape, texture and the like of an image by using an image analysis program, the extracted picture feature vectors are subjected to data analysis and matching by using an established inverted grid index based on a hash algorithm, k nearest neighbors are returned according to the matching precision requirement, and then corresponding k images are found according to the k vectors, and information is fed back to a user side in time.
Example 7:the processing method of the position sensitive hash has the same technical scheme as that of the embodiment 6, and the processing method comprises the following steps: firstly, establishing a distributed inverted space high-dimensional index based on LSH through massive data sets, taking a hashed bucket as a Key, taking a point set in the bucket as a Value, and then carrying out MapReduceAnd (5) carrying out distributed solving. And then, combining a plurality of rounds of local sensitive hash and a dynamic collision frequency detection algorithm, and utilizing MapReduce to screen and query and optimize results. And finally, a result set is obtained, and meanwhile, adjacent points can be collected in adjacent barrels by using the inverted index, and the approximate searching speed can be increased by detecting the adjacent barrels.
In order to reduce the intermediate data volume of MapReduce, a method based on Hash collision counting is adopted, the actual value is replaced by collision counting statistics, the number of collisions is counted, and the final sorting result is used, so that the intermediate data volume can be greatly reduced, and the processing speed of MapReduce is increased.
The embodiment adopts a large-scale distributed hash algorithm and other space data processing algorithms, integrates a large-data processing mode into the query stage of the large-scale high-dimensional image retrieval system by analyzing the characteristics of the color, the shape, the texture and the like of the images, searches the optimal images in mass data, feeds back the image information to a user side, and finally completes the image retrieval problem. The effect is as follows: by analyzing the characteristics of the color, shape, texture and the like of the picture, the best approximate matching picture is provided for the user.
The large-scale high-dimensional image retrieval described in this embodiment has the following structure and benefits:
(1) The application adopts a design mode of intelligent selection of a connecting line, so that a mobile phone selects a more appropriate cloud computing server to transmit data; the software is installed on the mobile phone and the server, the client is installed on the software of the smart mobile phone, once a user transmits a certain required similar picture to the mobile phone software, the software firstly extracts the data characteristics of the picture, and then the data is simply matched by utilizing partial data stored in the memory of the mobile phone. If the memory of the mobile phone contains the needed data, the software converts the data into corresponding pictures to be displayed on the software; if the corresponding data does not exist in the memory data of the mobile phone, the software is connected with the server by using the 2G, the 3G and the 4G, WIFI, the picture characteristic data is transmitted to the server, the LSRP-tree retrieval index is carried out in the server, and the data is fed back to the mobile phone software after the completion, and the result is displayed.
(2) Cloud computing is a resource sharing platform based on the Internet of things, and the platform can be provided for computers and other devices as required through shared software and hardware resources and information. Therefore, the large-scale high-dimensional image retrieval system based on the inverted position-sensitive hash index utilizes the characteristic of cloud computing to provide a powerful data processing system for inquiring the picture characteristics, and the support of the powerful data processing system can enable the mobile phone to present the matched picture required by a user under the condition of limited hardware facilities.
(3) The data processed by the mobile phone after the user uploads the picture to the software is simply memory reproduction, and the mobile phone memory is limited after all, so the picture index can not basically meet the user requirement. The real picture index is to transmit the picture feature data to a cloud server by using a network, and for the server, a massive data set is divided into a plurality of sub data sets by MapReduce in advance, tasks of the sub data sets are scheduled, then a distributed inverted space high-dimensional index based on LSH is established, and a final result is obtained by carrying out multi-round local sensitive hash dynamic collision detection algorithm calculation by hashing similar data to the same area.
Example 8:the present embodiment gives a definition of kNN queries.
The formalized definition of a kNN query is given below: given a set of spatial data points P, query point q and an integer k (k > 0), k-nearest neighbor queries find a set kNN of k data points, and satisfy dist (P ', q). Ltoreq.dist (P, q) for any P' ∈kNN and any P ε P-kNN.
For picture indexes in a high-dimensional environment, the H-c2kNN algorithm formed by combining the LSH index and the MapReduce is not lost as a preferred algorithm. One important source of spatially large-scale data, high-dimensional data, is the high-speed growth of data. How to make kNN queries on large-scale spatial data under high-dimensional conditions is an important direction of interest in recent years.
The following steps are further explanation of hashvalue=computer (q, h) in embodiment 2, r= getCandidates (hashvalue), and algorithm execution is the calculation performed by (1) initialization (2) in step 1 in embodiment 2.
Setting a group of hash functions F (), calculating and mapping all objects in space to m hash tables T i In the above, m= |f|, that is, each hash function F belongs to one hash table corresponding to F, and each hash table stores all object points in the space. Given a query point q, function result values of each hash function to q point mapping are calculated respectively.
The following steps correspond to the d=computer (q, c) calculation collision algorithm in embodiment 3, which corresponds to step 2 in embodiment 2, and is an optimization process. Example 2 is actually an algorithm execution process, and example 3 is a theoretical analysis of the algorithm, i.e., a mathematical operation process.
In performing a hash collision, for the hash function F (H), any M, M 'is found, and satisfying H (M) =h (M') presents a computational difficulty, then H (M) is referred to as a strong one-way hash function or a collision-free function, also commonly referred to as a hash function. Therefore, we replace the actual value with the hash collision number that is easier to perform. The number of times of collision is counted as a final sorting result, so that the amount of intermediate data can be greatly reduced, the MapReduce processing speed is increased, and the processing result is fed back to a user more quickly.
For server operation, data differentiation and data mirroring are utilized, and indiscriminate data service is provided by a cloud computing providing technology.
Example 9:referring to fig. 1, a schematic diagram of a large-scale high-dimensional image retrieval system based on an inverted position-sensitive hash index in a cloud computing environment is described as follows: establishing a distributed inverted space high-dimensional index based on LSH to perform knN query, setting a data set as S and a query object as Q, initializing a similar point set of a correlation function h, h belongs to G and the corresponding Q for each Q, and performing Guan Haxi function collision in a region with a radius of R, wherein each point of the hash is hashvalue=computer (Q, h), and an error calibration range (1+theta) R is given to the collision region because of a certain error in the R region. And due to the existence of system errors, the collision needs to be carried out for multiple times to screen the final productResults: the collision occurs between q and adjacent data, and at the same time, a small part of non-adjacent data collides with the data; then, performing second collision and third collision until all data are hashed, screening out points d=computer (q, c) with relatively high collision times, wherein the probability of the points being the adjacent points of q is highest, and then sorting out the data for integration.
Example 10:referring to fig. 2, in the kNN query index of the large-scale high-dimensional image, since kNN algorithm is based on a distributed inverted grid index, we set the grid cell size as "g", and set a query point q, and use Aq to represent the grid where q points are located, use r as radius, use q as center of circle to make a circle, and P (xi, yi) represents the nearest distance of the q points in. For a specific hash function, a point q is given, when the adjacent point is calculated, the function is firstly divided into a plurality of 'barrel' keys (referring to the existence area of the specific relation, which is expressed by hv1, hv2, hv3, hvN), then the function is correspondingly hashed into the specific barrel by the relation between the data and the hash function, and meanwhile, q collides with the data in the barrel with the corresponding relation, namely, key i= hvi =g (i), and the data with higher times of collision with q are screened out to integrate, so that the similar data of a certain specific point can be obtained.
For the index of a picture, more features are needed to determine a specific picture, so that similar data of other features are required to be obtained, for example, given another point P, similar data are also built in the same data table, R is taken as a radius, P is taken as a center of a circle, a neighbor data point is found, the nearest neighbor of P and q points possibly has the same point M, the probability that M is the feature data point of the index is particularly high, and the like, and hashing is performed until final data are found.
Example 11:referring to fig. 3, in consideration of the portable characteristic of the mobile terminal, the limitation of software and hardware resources and the advantages of cloud computing, the large-scale high-dimensional image retrieval system based on the inverted position-sensitive hash index uses a thin client mode of a C/S architecture, a cloud server is responsible for main data processing work, a client only needs to simply send a required chart, receive and display a result, and a handheld device client is used for processing the data in a 2G/3G/4G mode or based on the dataThe wireless network of the WIFI is accessed to the mobile internet to establish a connection with the cloud server, the client is responsible for displaying pictures and carrying relevant parameters, such as picture feature data information, to send a request to the cloud server, after a mobile phone logs in, the cloud server sends picture features to the cloud server, the cloud server adopts a high-dimensional space mass data distributed space inverted index technology designed by us, namely, an LSH technology sends picture feature data to the cloud server, and the cloud server rapidly searches pictures required by a user from massive charts by adopting a parallelization kNN query technology. And inquiring the client software interface to which the picture is sent, thereby completing the requirement of the user.
Example 12:referring to fig. 4, the present invention is a system for searching pictures by using an intelligent mobile platform, which includes a cloud server and a mobile client, specifically, software installed on the intelligent mobile platform (such as a smart phone or a tablet computer) for users to use respectively. The client comprises basic functions of gallery, photo shooting, transmission, picture scanning and the like, and the cloud server is responsible for controlling the whole picture searching process and processing related data (including establishing an LSH index, inquiring distributed kNN and the like).
Example 13:referring to fig. 5, the steps involved in implementing the graph lookup by the present invention are as follows: the user obtains the required similar pictures through shooting or other ways, scans through the client, the mobile phone client finds out the picture characteristics and uploads the picture characteristics to the cloud server, the cloud server performs data processing through the uploaded picture characteristic data, finds out the similar characteristic picture data, then returns the data to the client, and the client indexes the pictures to be displayed on a software interface so as to complete tasks. After the pictures are displayed on the client interface, picture numbers, source links and the like are displayed, so that a user can know chart information more deeply.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims (1)

1. A high-dimensional space kNN query method based on inverted position sensitive hash index is established, which is characterized in that: the method comprises the following steps: setting the size of a grid unit as a front-name, setting a query point q, using Aq to represent the grid where the q point is located, using r as a radius, q as a circle center, for a specific Hash function, setting a point q, when finding adjacent points, dividing the function into a plurality of Hash barrels, using the Hash barrels as keys, using a point set in the Hash barrels as Value, using the Hash barrels to refer to a region with a specific relation, using hv1, hv2 and hv3 to hvN to represent the Hash, using the relation between the data and the Hash function to Hash into the specific barrel, and simultaneously enabling the q to collide with the data in the barrel with the corresponding relation, screening out the data with higher q collision times, and integrating the data with the specific point to obtain the similar data of the specific point; for the index of a picture, more features are needed to determine a specific picture, similar data of other features are required to be obtained, another point P is given, the similar data is also built in the same data table, R is taken as a radius, P is taken as a circle center, adjacent data points are found, the same point M possibly appears in the nearest neighbor of P and q points, the probability that M is the index feature data point is particularly high, and the like is hashed until final data are found;
the method for screening the data with higher q collision times comprises the following steps:
setting a high-dimensional data set as S, wherein S is an existing large-scale image library in an image retrieval system, a query object set is Q, Q is a query image object, a feature set is formed after high-dimensional feature extraction is performed, for each query object Q belongs to Q, an association function h is initialized, h belongs to G, G is a hash family, LSH is a multi-round hash algorithm, different hash functions can obtain different hash results, h corresponds to a Q similarity point set, a radius set R= getCandidates (hashvalue), hash value is a hash value, different hash values are obtained, and R is a bucket width; carrying out hash collision with Guan Haxi function in a certain radius r, wherein each object in the hash is hash value=computer (q, h), each hash value is obtained through a hash calculation function, the hash is carried out for multiple times until all data are hashed, objects d=computer (q, c) with relatively high collision times are screened out, the probability that the object d is the nearest point of the object q is highest, the object d is the object with highest collision probability, the actual value is replaced by collision count statistics, the number of times of collision is taken as the final sequencing result, the intermediate data quantity is reduced, and the processing speed of MapReduce is accelerated;
in order to correct the error of the area with the radius r, an error calibration range (1+theta) r is set for the collision area; when hash collision is carried out, counting the number of times of collision as a final sorting basis for a hash function F (h), wherein the more the number of times of collision is, the more the sorting is, and the collision needs to be carried out for a plurality of times to screen a final result due to the existence of a system error: the collision occurs between q and adjacent data, and at the same time, a small part of non-adjacent data collides with the data; then performing second collision and third collision until all data are hashed, screening out points d=computer (q, c) with relatively high collision times, wherein the probability of the points being the adjacent points of q is highest, and then sorting out the data for integration.
CN201910325257.7A 2016-02-05 2016-02-05 High-dimensional space kNN query method based on inverted position sensitive hash index Active CN110046268B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910325257.7A CN110046268B (en) 2016-02-05 2016-02-05 High-dimensional space kNN query method based on inverted position sensitive hash index

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910325257.7A CN110046268B (en) 2016-02-05 2016-02-05 High-dimensional space kNN query method based on inverted position sensitive hash index
CN201610083263.2A CN105760469B (en) 2016-02-05 2016-02-05 Higher-dimension approximation image retrieval method based on the row of falling LSH under cloud computing environment

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201610083263.2A Division CN105760469B (en) 2016-02-05 2016-02-05 Higher-dimension approximation image retrieval method based on the row of falling LSH under cloud computing environment

Publications (2)

Publication Number Publication Date
CN110046268A CN110046268A (en) 2019-07-23
CN110046268B true CN110046268B (en) 2024-04-05

Family

ID=56329766

Family Applications (3)

Application Number Title Priority Date Filing Date
CN201610083263.2A Active CN105760469B (en) 2016-02-05 2016-02-05 Higher-dimension approximation image retrieval method based on the row of falling LSH under cloud computing environment
CN201910325257.7A Active CN110046268B (en) 2016-02-05 2016-02-05 High-dimensional space kNN query method based on inverted position sensitive hash index
CN201910324441.XA Pending CN110059208A (en) 2016-02-05 2016-02-05 It is filtered out and the higher distributed data processing method of query point collision frequency using inverted index

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201610083263.2A Active CN105760469B (en) 2016-02-05 2016-02-05 Higher-dimension approximation image retrieval method based on the row of falling LSH under cloud computing environment

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910324441.XA Pending CN110059208A (en) 2016-02-05 2016-02-05 It is filtered out and the higher distributed data processing method of query point collision frequency using inverted index

Country Status (1)

Country Link
CN (3) CN105760469B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777130B (en) * 2016-12-16 2020-05-12 西安电子科技大学 A kind of index generation method, data retrieval method and device
CN107391554B (en) * 2017-06-07 2021-10-01 中国人民解放军国防科学技术大学 Efficient Distributed Locality-Sensitive Hashing Method
CN107818147A (en) * 2017-10-19 2018-03-20 大连大学 Distributed temporal index system based on Voronoi diagram
CN109271437A (en) * 2018-09-27 2019-01-25 智庭(北京)智能科技有限公司 A kind of Query method in real time of magnanimity rent information
CN110059634B (en) * 2019-04-19 2023-04-18 山东博昂信息科技有限公司 Large-scene face snapshot method
CN110222775B (en) * 2019-06-10 2021-05-25 北京字节跳动网络技术有限公司 Image processing method, image processing device, electronic equipment and computer readable storage medium
EP3779733A1 (en) * 2019-08-12 2021-02-17 Universität Bern Information retrieval method
CN110569244A (en) * 2019-08-30 2019-12-13 深圳计算科学研究院 Hamming space approximate query method and storage medium
CN110795432B (en) * 2019-10-29 2024-08-30 腾讯云计算(北京)有限责任公司 Retrieval method and device of feature data and storage medium
CN113010525B (en) * 2021-04-01 2023-08-01 东北大学 A PID-based Parallel KNN Query Processing Method for Ocean Spatiotemporal Big Data
CN113407749B (en) * 2021-06-28 2024-04-30 北京百度网讯科技有限公司 Picture index construction method and device, electronic equipment and storage medium
CN115129921B (en) * 2022-06-30 2023-05-26 重庆紫光华山智安科技有限公司 Picture retrieval method, apparatus, electronic device, and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324650A (en) * 2012-10-23 2013-09-25 深圳市宜搜科技发展有限公司 Image retrieval method and system
CN103455531A (en) * 2013-02-01 2013-12-18 深圳信息职业技术学院 Parallel indexing method supporting real-time biased query of high dimensional data
CN104035949A (en) * 2013-12-10 2014-09-10 南京信息工程大学 Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN104199827A (en) * 2014-07-24 2014-12-10 北京大学 Locality-sensitive-hashing-based high-dimensional indexing method for large-scale multimedia data
CN104391908A (en) * 2014-11-17 2015-03-04 南京邮电大学 Locality sensitive hashing based indexing method for multiple keywords on graphs

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722554B (en) * 2012-05-28 2014-07-02 中国人民解放军信息工程大学 Randomness weakening method of location-sensitive hash
CN103488679A (en) * 2013-08-14 2014-01-01 大连大学 Inverted grid index-based car-sharing system under mobile cloud computing environment
CN104699701A (en) * 2013-12-05 2015-06-10 深圳先进技术研究院 Parallel nearest node computing method and distributed system based on sensitive hashing
CN103744934A (en) * 2013-12-30 2014-04-23 南京大学 Distributed index method based on LSH (Locality Sensitive Hashing)

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324650A (en) * 2012-10-23 2013-09-25 深圳市宜搜科技发展有限公司 Image retrieval method and system
CN103455531A (en) * 2013-02-01 2013-12-18 深圳信息职业技术学院 Parallel indexing method supporting real-time biased query of high dimensional data
CN104035949A (en) * 2013-12-10 2014-09-10 南京信息工程大学 Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN104199827A (en) * 2014-07-24 2014-12-10 北京大学 Locality-sensitive-hashing-based high-dimensional indexing method for large-scale multimedia data
CN104391908A (en) * 2014-11-17 2015-03-04 南京邮电大学 Locality sensitive hashing based indexing method for multiple keywords on graphs

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
pingfei zhu 等.Efficient k- Nearest Neighbors Search in High Dimensions using MapReduce.2015 IEEE Fifth International Conference on Big Data and Cloud Computing.2015,23-30. *
邱文明.基于位置敏感哈希的近似kNN查询算法研究.中国优秀硕士学位论文数据库 信息科技辑.2015,(第7期),33-42. *

Also Published As

Publication number Publication date
CN105760469A (en) 2016-07-13
CN105760469B (en) 2019-05-31
CN110046268A (en) 2019-07-23
CN110059208A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
CN110046268B (en) High-dimensional space kNN query method based on inverted position sensitive hash index
CN110175258B (en) Mobile perception data query method based on position sensitive hash index
US10102227B2 (en) Image-based faceted system and method
TWI745818B (en) Method and electronic equipment for visual positioning and computer readable storage medium thereof
WO2021057744A1 (en) Positioning method and apparatus, and device and storage medium
JP5282658B2 (en) Image learning, automatic annotation, search method and apparatus
TWI761851B (en) Image processing method, image processing apparatus, electronic device, and computer-readable storage medium
RU2533441C2 (en) Method and apparatus for facilitating content-based image search
CN102831405B (en) Method and system for outdoor large-scale object identification on basis of distributed and brute-force matching
TWI785267B (en) Method and electronic apparatus for image processing and storage medium thereof
KR20100068468A (en) Method, apparatus and computer program product for performing a visual search using grid-based feature organization
CN105824928A (en) Mobile terminal, server, content-based image recognition search method and system
JP2023520625A (en) IMAGE FEATURE MATCHING METHOD AND RELATED DEVICE, DEVICE AND STORAGE MEDIUM
WO2019080908A1 (en) Image processing method and apparatus for implementing image recognition, and electronic device
CN102880854A (en) Distributed processing and Hash mapping-based outdoor massive object identification method and system
WO2022142049A1 (en) Map construction method and apparatus, device, storage medium, and computer program product
WO2023221790A1 (en) Image encoder training method and apparatus, device, and medium
CN103999097A (en) Systems and methods for compact descriptors for visual search
CN113378005B (en) Event processing method, device, electronic equipment and storage medium
KR20130036839A (en) Apparatus and method for image matching in augmented reality service system
CN116457795A (en) Apparatus and method for providing computationally efficient neural networks
KR101896177B1 (en) An Image Searching System Providing Multiplex Result
CN112581516B (en) Image matching method, device, electronic device and storage medium
CN112307829B (en) Digital retina mass target retrieval space-time matrix presentation method
Im et al. User-assisted OCR on outdoor images for approximate positioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant