[go: up one dir, main page]

WO2017082875A1 - Data allocation based on secure information retrieval - Google Patents

Data allocation based on secure information retrieval Download PDF

Info

Publication number
WO2017082875A1
WO2017082875A1 PCT/US2015/059934 US2015059934W WO2017082875A1 WO 2017082875 A1 WO2017082875 A1 WO 2017082875A1 US 2015059934 W US2015059934 W US 2015059934W WO 2017082875 A1 WO2017082875 A1 WO 2017082875A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
secure
query
processor
query term
Prior art date
Application number
PCT/US2015/059934
Other languages
French (fr)
Inventor
Mehran KAFAI
Manav DAS
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/059934 priority Critical patent/WO2017082875A1/en
Priority to US15/774,708 priority patent/US10783268B2/en
Publication of WO2017082875A1 publication Critical patent/WO2017082875A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules

Definitions

  • Figure 1 is a functional block diagram illustrating one example of a system for data allocation based on secure information retrieval.
  • Figure 4 is a flow diagram illustrating one example of a method for providing a query term to a target dataset in data allocation based on secure information retrieval.
  • Data storage products are often used to store large amounts of similar data.
  • human error or system error outside the backup device, may result in a data item being erroneously copied to a device other than the one for which it was intended. This may result in data loss, and/or accidental exposure of the data item to a third party, potentially with serious legal and/or commercial ramifications.
  • Data allocation based on secure information retrieval is a secure protocol that allows one party to retrieve information from a plurality of second parties without revealing the data that supports the information.
  • One example is a system including an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets. The information processor receives a request from the query processor for identification of a target dataset to be associated with a query term.
  • U is a very large integer relative to H. In one example, U is a power of 2.
  • the information processor 106 receives, from the query processor 102, the request 104 for identification of a target dataset 116 to be associated with the query term.
  • the query term may be an N-dimensional vector with numerical, real-valued components.
  • identification generally refers to identifying a target dataset that may be a suitable destination for the query term. For example, identification may mean identifying a dataset that includes terms that are most similar to the query term.
  • the term“similar” may be used broadly to include any type of similarity for data terms. [0018] As described herein, in some examples, system 100 may be provided with values for hash count H and hash universe size U.
  • the query processor 102 and the plurality of data processors respectively apply a predetermined orthogonal transform to the query term and the data terms in the respective datasets, and select the top hashes.
  • the transformation of the query term and the plurality of terms may be based on the hash number H.
  • the hash transform may be an orthogonal transformation.
  • the term“orthogonal transformation” as used herein generally refers to a linear transformation between two linear spaces that preserves their respective linear structure (e.g., preserves an inner product).
  • the hash transform may be a Walsh-Hadamard transformation (“WHT”).
  • hash transform may be the WHT applied to the query term and the plurality of terms to provide coefficients of the WHT.
  • a WHT is an orthogonal, non- sinusoidal transform that takes a signal as input and outputs a set of basis functions. The output functions are known as Walsh functions.
  • a Walsh function takes two values: +1 and -1.
  • performing an orthogonal transform on a data term provides a set of coefficients associated with the data term.
  • the information processor 106 may provide a collection of coefficients
  • Alternative forms of orthogonal transforms may be
  • the largest H coefficients (based on the hash number) of the WHT may comprise a transformed query term and a plurality of transformed data terms.
  • the H largest coefficients may be
  • the information processor 106 may determine an index of hash positions based on the orthogonal transform.
  • positions 1 and 5 are indicative of data terms A and C since these data terms have hashes at the hash positions 1 and 5, as illustrated in Table 1.
  • position 13 is indicative of data terms A and B since these data terms have hashes at the hash position 13, as illustrated in Table 1.
  • the query term and the plurality of terms in the datasets may be transformed based on the random permutation.
  • the transformation may comprise an extension of a numerical vector by concatenating it with itself to generate a vector of length U.
  • the permutation may be utilized to generate a transformed query term.
  • the query term e.g. a numerical vector
  • the permutation may be applied to the extended vector, and then the orthogonal transform may be applied to generate a transformed query term.
  • the result is an H- dimensional feature vector that represents a secure version of the query term. Secure versions of other data terms may be generated in like manner in the respective datasets.
  • the data terms A and B have position 13 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (A,B), denoted as S(A,B) may be determined to be 1. Also, for example, as illustrated in Table 2, the data terms A and C have positions 1 and 5 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (A,C), denoted as S(A,C) may be determined to be 2. As another example, as illustrated in Table 2, the data terms B and C have position 7 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (B,C), denoted as S(B,C) may be determined to be 1.
  • G 2V.
  • the information processor 106 receives, from the query processor 102, the secure version of the query term, where the secure version is based on the hash number and the permutation, as described herein. Likewise, the information processor 106 receives, from each of the plurality of data processors (e.g., Data Processor 1110(1), Data Processor 2110(2),..., Data Processor Y 110(y)), secure versions of a collection of candidate terms, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
  • each of the plurality of data processors e.g., Data Processor 1110(1), Data Processor 2110(2),..., Data Processor Y 110(y)
  • secure versions of a collection of candidate terms where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
  • the information processor 106 determines similarity scores between the secure version of the query term and the secure versions of the candidate terms, where the similarity score is indicative of proximity of the secure versions of the candidate terms to the secure version of the query term, and based on shared data elements between the secure version of the query term and the secure versions of the candidate terms.
  • the secure version of the query term and each secure candidate term may be of same length
  • the similarity score may be a ratio of a number of shared data elements between the secure version of the query term and a secure candidate term to that length.
  • the information processor 106 identifies the target dataset 116 of the plurality of datasets based on the determined similarity scores. In some examples, the information processor 106 selects, for each of the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),..., Data Processor Y 110(y)), a representative term of the collection of candidate terms. For example, the information processor 106 may compute the similarity between the secure version of the query term and the secure versions of the candidate terms received from all the data processors by measuring overlaps of the hashes. The information processor 106 may then determine, for each data processor, the top candidate term defined as the candidate term with the highest similarity to the query term, and select this top candidate term as the representative term for the data processor. In some examples, more than one representative term may be selected for a data processor. Thereafter, the information processor 106 provides, to each of the plurality of data processors, the respective representative terms.
  • the information processor 106 provides, to each of the plurality of data processors, the respective
  • the data processor may determine the comparative statistic between the secure version of the query term and the secure data terms in the cluster associated with the representative term without knowledge of the query term, where the determination is based on the similarity scores (determined at the information processor 106) between the secure version of the query term and the secure version of the representative term.
  • the main idea behind the selection of the target dataset 116 is to estimate the similarity between the secure version of the query term and all secure terms in the data processor, Data Processor 1110(1), in the cluster associated with the representative term by only knowing the similarity score between the secure version of the query term and the secure version of the representative term from the data processor, Data Processor 1110(1).
  • the data processors each share a secure version of the representative term with the information processor 106, and the query processor 102 only shares a secure version of the query term with the information processor 106.
  • the information processor 106 is not privy to the actual composition of the query term in the query processor 102, and the plurality of terms in the dataset associated with a data processor, say Data Processor 1 110(1).
  • the query processor 102 has no knowledge of the actual composition of the plurality of terms in the dataset associated with a data processor, say Data Processor 1 110(1).
  • the dataset associated with a data processor, say Data Processor 1 110(1) has no knowledge of the actual composition of the query term in the query processor 102.
  • the information processor 106 computes similarity scores between the secure version of the query term and the secure versions of the representative terms, and provides the determined similarity scores to the data processor.
  • the data processor utilizes the comparative distribution techniques disclosed herein, to determine similarity scores between the secure version of the query term and the plurality of secure versions of data terms based on the similarity scores between the secure version of the query term and the secure versions of the representative terms received from the information processor 106.
  • the information processor 106 provides the query term to the identified target dataset 116.
  • the plurality of datasets may be a respective plurality of secure storage containers, and the query term may be a data term to be stored in a target storage container associated with the target dataset 116. Accordingly, the information processor 106 is able to provide the query term to the data container that has data elements that are most similar to the query term, thereby minimizing errors in data allocation.
  • the information processor 106 provides the identity of the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target dataset 116.
  • the information processor 106 may select a target cluster of the target dataset 116 based on the determined similarity scores, and may associate the query term with the target cluster in the target dataset 116.
  • the target cluster may be the cluster associated with the representative term.
  • the target dataset 116 may be a secure storage container, and the information processor 106 may associate the query term with a cluster in the secure storage container.
  • the secure storage container selected as the target dataset 116 may comprise partitions for data storage, and the information processor 106 may associate the query term with a partition of the secure storage container.
  • the information processor 106 provides the identity of the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target dataset 116. In some examples, the information processor 106 provides the identity of the target cluster in the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target cluster in the target dataset 116.
  • the information processor 106 may receive a second query term from the query processor 102. In some examples, the information processor 106 may determine if the second query term is similar to the query term, and upon a determination that the second query term is similar to the query term, provide the second query term to the identified target dataset 116. In some examples, upon a determination that the second query term is not similar to the query term, the information processor 106 may identify a second target dataset, and provide the second query term to the identified second target dataset.
  • the information processor 106 may rank the plurality of datasets based on the determined similarity scores, and/or the comparative statistic. For example, the representative terms may be ranked based on respective similarity scores. Accordingly, the associated datasets may be ranked based on the ranking for the representative terms.
  • the comparative statistic may be utilized to rank the plurality of datasets, and the information processor 106 may provide a list of top-k datasets to the query processor 102. The query processor 102 may then prompt the information processor 106 to provide the query term to a sub-plurality of the top-k datasets.
  • the information processor 106 may rank the clusters in a given dataset, and may provide the query term to a sub-plurality of the clusters based on the determined ranking.
  • the components of system 100 may be computing resources, each including a suitable combination of a physical computing device, a virtual computing device, a network, software, a cloud infrastructure, a hybrid cloud infrastructure that may include a first cloud infrastructure and a second cloud infrastructure that is different from the first cloud infrastructure, and so forth.
  • the components of system 100 may be a combination of hardware and programming for performing a designated visualization function.
  • each component may include a processor and a memory, while programming code is stored on that memory and executable by a processor to perform a designated visualization function.
  • the information processor 106 may be a combination of hardware and programming for performing a designated function.
  • the information processor 106 may include programming to receive the query term and the candidate terms, and determine similarity scores for the query term and the candidate terms.
  • the information processor 106 may include hardware to physically store the similarity scores, and processors to physically process the received terms and determined similarity scores.
  • information processor 106 may include software programming to dynamically interact with the other components of system 100.
  • the components of system 100 may include programming and/or physical networks to be communicatively linked to other components of system 100.
  • the components of system 100 may include a processor and a memory, while programming code is stored and on that memory and executable by a processor to perform designated functions.
  • FIG. 2 is a block diagram illustrating one example of a computer readable medium for data allocation based on secure information retrieval.
  • Processing system 200 includes a processor 202, a computer readable medium 208, input devices 204, and output devices 206.
  • Processor 202, computer readable medium 208, input devices 204, and output devices 206 are coupled to each other through a communication link (e.g., a bus).
  • a communication link e.g., a bus
  • Processor 202 executes instructions included in the computer readable medium 208.
  • Computer readable medium 208 includes request receipt instructions 210 to receive, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.
  • Computer readable medium 208 includes permutation generation instructions 212 to generate a random permutation based on the hash length.
  • Computer readable medium 208 includes secure version of the query term receipt instructions 214 to receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation.
  • Computer readable medium 208 includes candidate term receipt instructions 216 to receive, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
  • Computer readable medium 208 includes similarity score determination instructions 218 to determine similarity scores between the secure version of the query term and secure versions of the candidate terms.
  • Computer readable medium 208 includes target dataset identification instructions 220 to identify the target dataset of the plurality of datasets based on the determined similarity scores.
  • Computer readable medium 208 includes query term provide instructions 222 to provide the query term to the identified target dataset.
  • Input devices 204 include a keyboard, mouse, data ports, and/or other suitable devices for inputting information into processing system 200.
  • input devices 204 such as a computing device, are used by the interaction processor to receive a query term.
  • Output devices 206 include a monitor, speakers, data ports, and/or other suitable devices for outputting information from processing system 200. In some examples, output devices 206 are used to provide the query term to the target dataset.
  • a“computer readable medium” may be any electronic, magnetic, optical, or other physical storage apparatus to contain or store information such as executable instructions, data, and the like.
  • any computer readable storage medium described herein may be any of Random Access Memory (RAM), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., a hard drive), a solid state drive, and the like, or a combination thereof.
  • RAM Random Access Memory
  • volatile memory volatile memory
  • non-volatile memory non-volatile memory
  • flash memory e.g., a hard drive
  • solid state drive e.g., a solid state drive, and the like, or a combination thereof.
  • the programming may be processor executable instructions stored on tangible computer readable medium 208, and the hardware may include Processor 202 for executing those instructions.
  • computer readable medium 208 may store program instructions that, when executed by Processor 202, implement the various components of the processing system 200.
  • Such computer readable storage medium or media is (are) considered to be part of an article (or article of manufacture).
  • An article or article of manufacture can refer to any manufactured single component or multiple components.
  • the storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine- readable instructions can be downloaded over a network for execution.
  • Computer readable medium 208 may be any of a number of memory components capable of storing instructions that can be executed by processor 202. Computer readable medium 208 may be non-transitory in the sense that it does not encompass a transitory signal but instead is made up of one or more memory components configured to store the relevant instructions. Computer readable medium 208 may be implemented in a single device or distributed across devices. Likewise, processor 202 represents any number of processors capable of executing instructions stored by computer readable medium 208. Processor 202 may be integrated in a single device or distributed across devices. Further, computer readable medium 208 may be fully or partially integrated in the same device as processor 202 (as illustrated), or it may be separate but accessible to that device and processor 202. In some examples, computer readable medium 208 may be a machine-readable storage medium.
  • Figure 3 is a flow diagram illustrating one example of a method for data allocation based on secure information retrieval.
  • a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.
  • a random permutation may be generated based on the hash length.
  • secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
  • similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.
  • a representative term of the collection of candidate terms may be selected for each of the plurality of data processors.
  • a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.
  • the plurality of datasets may be a respective plurality of secure storage containers
  • the query term may be a data term to be stored in a target storage container associated with the target dataset.
  • Figure 4 is a flow diagram illustrating one example of a method for providing a query term to a target dataset in data allocation based on secure information retrieval.
  • a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.
  • a random permutation may be generated based on the hash length.
  • a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.
  • secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
  • similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.
  • a representative term of the collection of candidate terms may be selected for each of the plurality of data processors.
  • a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.
  • the query term may be provided to the identified target dataset.
  • Figure 5 is a flow diagram illustrating one example of a method for associating a query term with a target cluster in a target dataset in data allocation based on secure information retrieval.
  • a random permutation may be generated based on the hash length.
  • the target dataset of the plurality of datasets may be identified based on the comparative statistic.
  • a first target dataset associated with a first query term may be identified.
  • a second target dataset may be identified based on methods described herein, and the second query term may be associated with the second target dataset. In some examples, the second query term may be provided to the second target dataset.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Data allocation based on secure information retrieval is disclosed. One example is a system including an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets. The information processor receives a request from the query processor for identification of a target dataset to be associated with a query term. The information processor generates a random permutation, and receives a secure version of the query term from the query processor, and receives secure versions of a collection of candidate terms from each of a plurality of data processors, each candidate term representing a cluster of similar terms in the associated dataset. The information processor determines similarity scores between the secure version of the query term and secure versions of the candidate terms, and identifies the target dataset of the plurality of datasets based on the determined similarity scores.

Description

DATA ALLOCATION BASED ON SECURE INFORMATION RETRIEVAL Background
[0001] Data storage products, such as data backup devices, may be used to store data that are similar. Separate storage products may store distinct types of data, while a given storage device may store similar data. In some examples, such storage devices may be secured. Data terms may be allocated to, or retrieved from, such storage devices. Brief Description of the Drawings
[0002] Figure 1 is a functional block diagram illustrating one example of a system for data allocation based on secure information retrieval.
[0003] Figure 2 is a block diagram illustrating one example of a computer readable medium for data allocation based on secure information retrieval.
[0004] Figure 3 is a flow diagram illustrating one example of a method for data allocation based on secure information retrieval.
[0005] Figure 4 is a flow diagram illustrating one example of a method for providing a query term to a target dataset in data allocation based on secure information retrieval.
[0006] Figure 5 is a flow diagram illustrating one example of a method for associating a query term with a target cluster in a target dataset in data allocation based on secure information retrieval.
[0007] Figure 6 is a flow diagram illustrating one example of a method for providing a second query term to a target dataset in data allocation based on secure information retrieval. Detailed Description
[0008] Data storage products, especially data backup devices, are often used to store large amounts of similar data. In some instances, human error, or system error outside the backup device, may result in a data item being erroneously copied to a device other than the one for which it was intended. This may result in data loss, and/or accidental exposure of the data item to a third party, potentially with serious legal and/or commercial ramifications.
[0009] Data terms may be allocated to, or retrieved from, such storage devices. In some examples, the storage devices may be secured. In some examples, such storage devices may require data to be encrypted before being stored. In some examples, different storage devices may require different encryptions. Error in data storage may lead to inadvertent security loopholes and/or breaches.
[0010] In some instances, a first party may desire to securely store information in a plurality of storage devices. Based on a volume of data, such situations may result in an increase in a number of secure computations and inter-party data exchanges. Also, for example, there may be intermediate information processors that may not be secure, and/or may have unreliable data protection mechanisms. In such instances, there is a need to not expose all the data from one or more of the parties. Accordingly, there is a need to compute similarity between data distributed over multiple parties, without exposing all the data from any party, and without a need for secure intermediaries.
[0011] Existing systems are generally directed to addressing the need for identifying storage devices with content similar to an incoming data element. However, such systems focus on identifying similarities with the data content after the data element has been stored in a storage device. Accordingly, there is a need to identify an appropriate data storage device prior to storing the incoming data element, while maintaining the anonymity of the data stored in the storage devices and the incoming data element.
[0012] As described in various examples herein, data allocation based on secure information retrieval is disclosed. Data allocation based on secure information retrieval is a secure protocol that allows one party to retrieve information from a plurality of second parties without revealing the data that supports the information. One example is a system including an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets. The information processor receives a request from the query processor for identification of a target dataset to be associated with a query term. The information processor generates a random permutation, and receives a secure version of the query term from the query processor, and receives secure versions of a collection of candidate terms from each of a plurality of data processors, each candidate term representing a cluster of similar terms in the associated dataset. The information processor determines similarity scores between the secure version of the query term and secure versions of the candidate terms, and identifies the target dataset of the plurality of datasets based on the determined similarity scores.
[0013] In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosure may be practiced. It is to be understood that other examples may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. It is to be understood that features of the various examples described herein may be combined, in part or whole, with each other, unless specifically noted otherwise.
[0014] Figure 1 is a functional block diagram illustrating one example of a system 100 for system for data allocation based on secure information retrieval. System 100 is shown to include an information processor 106 communicatively linked to a query processor 102 and a plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)), respectively associated with a plurality of datasets or data containers,
Figure imgf000004_0001
(not shown). The query processor 102, the information processor 106, and the plurality of data processors, Data Processor 1110(1), Data Processor 2110(2),…, Data Processor Y 110(y)), are communicatively linked to one another via a network.
[0015] The term“system” may be used to refer to a single computing device or multiple computing devices that communicate with each other (e.g. via a network) and operate together to provide a unified service. In some examples, the components of system 100 may communicate with one another over a network. As described herein, the network may be any wired or wireless network, and may include any number of hubs, routers, switches, cell towers, and so forth. Such a network may be, for example, part of a cellular network, part of the internet, part of an intranet, and/or any other type of network. In some examples, the network may be a secured network.
[0016] In some examples, the datasets may include collections of data
Figure imgf000005_0001
terms that may be represented as d-dimensional real-valued vectors. A query dataset associated with the query processor 102 may include a query term that it may want to move/copy to one of the plurality of datasets. The goal of the secure storage container allocation process described herein is to minimize information leakage while transferring the query term from the query dataset to a target dataset 116, where the target dataset 116 has a subset of data terms that have high similarity to the query term. Generally, the datasets
Figure imgf000005_0002
may not want to share their information with other parties in the system 100.
[0017] To facilitate the secure storage container allocation process between all parties, a single intermediary processing node may be utilized, such as the information processor 106. For the purposes of this description, the information processor 106 may be assumed to be a semi-honest node, i.e., the information processor 106 follows the protocol as described herein; however, in some instances, it may utilize the messages that it receives to extract more information about the data terms. The query processor 102 sends a request 104 to the information processor 106. The request includes two parameters, a hash length U, and a hash number H indicative of a number of hashes per data term. The integer H may be experimentally determined based on the type and number of data terms in the incoming data stream 102. Generally, U is a very large integer relative to H. In one example, U is a power of 2. The information processor 106 receives, from the query processor 102, the request 104 for identification of a target dataset 116 to be associated with the query term. In some examples, the query term may be an N-dimensional vector with numerical, real-valued components. The term “identification” as used herein, generally refers to identifying a target dataset that may be a suitable destination for the query term. For example, identification may mean identifying a dataset that includes terms that are most similar to the query term. The term“similar” may be used broadly to include any type of similarity for data terms. [0018] As described herein, in some examples, system 100 may be provided with values for hash count H and hash universe size U. Generally, U is a very large integer relative to H and N. In some examples, U is a power of 2. In some examples, each 6000-dimensional vector (N = 6000) may be associated with 100 integers (H = 100) selected from the set {1, 2, 3,…, 218} (U = 218). Accordingly, the hash transform may transform a higher dimensional data term (e.g. with 6000 dimensions) into a lower dimensional transformed data term (e.g. with 100 dimensions). The information processor 106 generates a random permutation based on the hash length U, and sends the permutation to the query processor 102 and the plurality of data processors (e.g., Data Processor 1110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)). Each of the plurality of data processors also receives the hash number H.
[0019] The query processor 102 and the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)) respectively apply a predetermined orthogonal transform to the query term and the data terms in the respective datasets, and select the top hashes. For example, the transformation of the query term and the plurality of terms may be based on the hash number H. In some examples, the hash transform may be an orthogonal transformation. The term“orthogonal transformation” as used herein generally refers to a linear transformation between two linear spaces that preserves their respective linear structure (e.g., preserves an inner product). In some examples, the hash transform may be a Walsh-Hadamard transformation (“WHT”). In some examples, hash transform may be the WHT applied to the query term and the plurality of terms to provide coefficients of the WHT. A WHT is an orthogonal, non- sinusoidal transform that takes a signal as input and outputs a set of basis functions. The output functions are known as Walsh functions. A Walsh function takes two values: +1 and -1.
[0020] Generally, performing an orthogonal transform on a data term provides a set of coefficients associated with the data term. For example, after application of the WHT to the data term, the information processor 106 may provide a collection of coefficients Alternative forms of orthogonal transforms may be
Figure imgf000006_0001
utilized as well. For example, a cosine transform may be utilized. In some examples, the largest H coefficients (based on the hash number) of the WHT may comprise a transformed query term and a plurality of transformed data terms.
[0021] In some examples, the H largest coefficients, may be
Figure imgf000007_0001
selected as the subset of hashes associated with a data term, say A. Table 1 illustrates an exam le association of data terms A, B, and C, with sets of hashes:
Figure imgf000007_0002
[0022] In some examples, the information processor 106 may determine an index of hash positions based on the orthogonal transform. Table 2 illustrates an example index of hash positions for data terms with U = 24 key positions {1, 2,…, 16} with data terms A, B, and C, based on sets of H = 5 hashes in Table 1:
Figure imgf000007_0003
[0023] As illustrated, positions 1 and 5 are indicative of data terms A and C since these data terms have hashes at the hash positions 1 and 5, as illustrated in Table 1. Likewise, position 13 is indicative of data terms A and B since these data terms have hashes at the hash position 13, as illustrated in Table 1.
[0024] In some examples, as described herein, the query term and the plurality of terms in the datasets may be transformed based on the random permutation. For example, the transformation may comprise an extension of a numerical vector by concatenating it with itself to generate a vector of length U. In some examples, the permutation may be utilized to generate a transformed query term. For example, the query term (e.g. a numerical vector) may be extended, and the permutation may be applied to the extended vector, and then the orthogonal transform may be applied to generate a transformed query term. Accordingly, the result is an H- dimensional feature vector that represents a secure version of the query term. Secure versions of other data terms may be generated in like manner in the respective datasets. The term“secure version” as used herein, generally refers to a transformed version of a data term (e.g., the transformed query term as described herein) that includes hashes that encode significant information about the data term. Such a transformed version is secure because it is generally difficult to identify the components of the original data term from the hashed version. At the same time, the secure version offers an easy way to compare data terms for significant overlaps that are indicative of a high degree of similarity, without compromising the contents of the individual data terms.
[0025] In some examples, the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)) generate clusters of a plurality of terms in a dataset associated with a respective data processor, the clusters being based on similarity scores for pairs of terms, and the data processor selects a candidate term from each cluster. For example, Data Processor 1110(1) may be associated with a Dataset 1 including a plurality of terms (e.g., term 1, term 2,…, term X), and may generate clusters of the plurality of terms (e.g., term 1, term 2,…, term X).
[0026] A similarity score between two data terms may be determined based on a number of common hashes. This provides an approximate measure of similar data terms. The similarity score may be based on a number of overlaps between respective sets of hashes, and is indicative of proximity of the pair of data terms. Table 3 illustrates an example determination of similarity scores for pairs formed from the data terms A, B, and C:
Figure imgf000008_0001
[0027] As illustrated in Table 2, the data terms A and B have position 13 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (A,B), denoted as S(A,B) may be determined to be 1. Also, for example, as illustrated in Table 2, the data terms A and C have positions 1 and 5 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (A,C), denoted as S(A,C) may be determined to be 2. As another example, as illustrated in Table 2, the data terms B and C have position 7 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (B,C), denoted as S(B,C) may be determined to be 1.
[0028] A dataset may be partitioned into, say G clusters by grouping similar data terms together, where similar data terms are based on similarity scores. In some examples, a partitioning around k-medoids (“PAM”) algorithm may be utilized to obtain G candidate terms from the G clusters, the candidate terms denoted as where In some examples, G may be chosen based on
Figure imgf000009_0001
Figure imgf000009_0002
heuristics, for example, G = 2V.
[0029] The information processor 106 receives, from the query processor 102, the secure version of the query term, where the secure version is based on the hash number and the permutation, as described herein. Likewise, the information processor 106 receives, from each of the plurality of data processors (e.g., Data Processor 1110(1), Data Processor 2110(2),…, Data Processor Y 110(y)), secure versions of a collection of candidate terms, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
[0030] As described herein, the information processor 106 determines similarity scores between the secure version of the query term and the secure versions of the candidate terms, where the similarity score is indicative of proximity of the secure versions of the candidate terms to the secure version of the query term, and based on shared data elements between the secure version of the query term and the secure versions of the candidate terms. In some examples, the secure version of the query term and each secure candidate term may be of same length, and the similarity score may be a ratio of a number of shared data elements between the secure version of the query term and a secure candidate term to that length.
[0031] The information processor 106 identifies the target dataset 116 of the plurality of datasets based on the determined similarity scores. In some examples, the information processor 106 selects, for each of the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)), a representative term of the collection of candidate terms. For example, the information processor 106 may compute the similarity between the secure version of the query term and the secure versions of the candidate terms received from all the data processors by measuring overlaps of the hashes. The information processor 106 may then determine, for each data processor, the top candidate term defined as the candidate term with the highest similarity to the query term, and select this top candidate term as the representative term for the data processor. In some examples, more than one representative term may be selected for a data processor. Thereafter, the information processor 106 provides, to each of the plurality of data processors, the respective representative terms.
[0032] In some examples, each data processor may approximate the similarity between the secure version of the query term and the data terms from the cluster represented by the representative term determined by the information processor 106. Each data processor then computes a comparative statistic of all the approximated similarity scores between the query term and the terms in the cluster associated with the representative term. The term“comparative statistic” as used herein may be any statistical representation that captures a summary of the data (e.g., approximated similarity scores). In some examples, the comparative statistic may be one of a mean, median, or mode of the approximated similarity scores. In some examples, each data processor provides the comparative statistic between the representative term and its cluster of similar terms, to the information processor 106. The information processor 106 identifies the target dataset 116 based on the comparing. For example, the information processor 106 may identify a dataset with the highest statistic.
[0033] In some examples, the data processor may determine the comparative statistic between the secure version of the query term and the secure data terms in the cluster associated with the representative term without knowledge of the query term, where the determination is based on the similarity scores (determined at the information processor 106) between the secure version of the query term and the secure version of the representative term. The main idea behind the selection of the target dataset 116 is to estimate the similarity between the secure version of the query term and all secure terms in the data processor, Data Processor 1110(1), in the cluster associated with the representative term by only knowing the similarity score between the secure version of the query term and the secure version of the representative term from the data processor, Data Processor 1110(1).
[0034] Accordingly, in the secure information retrieval described herein, the data processors each share a secure version of the representative term with the information processor 106, and the query processor 102 only shares a secure version of the query term with the information processor 106. Accordingly, the information processor 106 is not privy to the actual composition of the query term in the query processor 102, and the plurality of terms in the dataset associated with a data processor, say Data Processor 1 110(1). Also, for example, the query processor 102 has no knowledge of the actual composition of the plurality of terms in the dataset associated with a data processor, say Data Processor 1 110(1). Likewise, the dataset associated with a data processor, say Data Processor 1 110(1), has no knowledge of the actual composition of the query term in the query processor 102. The information processor 106 computes similarity scores between the secure version of the query term and the secure versions of the representative terms, and provides the determined similarity scores to the data processor. The data processor, in turn, utilizes the comparative distribution techniques disclosed herein, to determine similarity scores between the secure version of the query term and the plurality of secure versions of data terms based on the similarity scores between the secure version of the query term and the secure versions of the representative terms received from the information processor 106.
[0035] As described herein, another advantage of such indirect determination of similarity scores is that if the query processor 102 requests an additional target dataset for a second query term, the same secure versions of the representative terms may be utilized again to select the additional target dataset.
[0036] In some examples, the information processor 106 provides the query term to the identified target dataset 116. For example, the plurality of datasets may be a respective plurality of secure storage containers, and the query term may be a data term to be stored in a target storage container associated with the target dataset 116. Accordingly, the information processor 106 is able to provide the query term to the data container that has data elements that are most similar to the query term, thereby minimizing errors in data allocation. In some examples, the information processor 106 provides the identity of the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target dataset 116.
[0037] In some examples, the information processor 106 may select a target cluster of the target dataset 116 based on the determined similarity scores, and may associate the query term with the target cluster in the target dataset 116. For example, the target cluster may be the cluster associated with the representative term. In some examples, the target dataset 116 may be a secure storage container, and the information processor 106 may associate the query term with a cluster in the secure storage container. In some examples, the secure storage container selected as the target dataset 116 may comprise partitions for data storage, and the information processor 106 may associate the query term with a partition of the secure storage container.
[0038] In some examples, the information processor 106 provides the identity of the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target dataset 116. In some examples, the information processor 106 provides the identity of the target cluster in the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target cluster in the target dataset 116.
[0039] In some examples the information processor 106 may receive a second query term from the query processor 102. In some examples, the information processor 106 may determine if the second query term is similar to the query term, and upon a determination that the second query term is similar to the query term, provide the second query term to the identified target dataset 116. In some examples, upon a determination that the second query term is not similar to the query term, the information processor 106 may identify a second target dataset, and provide the second query term to the identified second target dataset.
[0040] In some examples, the information processor 106 may rank the plurality of datasets based on the determined similarity scores, and/or the comparative statistic. For example, the representative terms may be ranked based on respective similarity scores. Accordingly, the associated datasets may be ranked based on the ranking for the representative terms. In some examples, the comparative statistic may be utilized to rank the plurality of datasets, and the information processor 106 may provide a list of top-k datasets to the query processor 102. The query processor 102 may then prompt the information processor 106 to provide the query term to a sub-plurality of the top-k datasets. In some examples, the information processor 106 may rank the clusters in a given dataset, and may provide the query term to a sub-plurality of the clusters based on the determined ranking.
[0041] The components of system 100 may be computing resources, each including a suitable combination of a physical computing device, a virtual computing device, a network, software, a cloud infrastructure, a hybrid cloud infrastructure that may include a first cloud infrastructure and a second cloud infrastructure that is different from the first cloud infrastructure, and so forth. The components of system 100 may be a combination of hardware and programming for performing a designated visualization function. In some instances, each component may include a processor and a memory, while programming code is stored on that memory and executable by a processor to perform a designated visualization function.
[0042] For example, the information processor 106 may be a combination of hardware and programming for performing a designated function. For example, the information processor 106 may include programming to receive the query term and the candidate terms, and determine similarity scores for the query term and the candidate terms. The information processor 106 may include hardware to physically store the similarity scores, and processors to physically process the received terms and determined similarity scores. Also, for example, information processor 106 may include software programming to dynamically interact with the other components of system 100.
[0043] Generally, the components of system 100 may include programming and/or physical networks to be communicatively linked to other components of system 100. In some instances, the components of system 100 may include a processor and a memory, while programming code is stored and on that memory and executable by a processor to perform designated functions.
[0044] Generally, the query processor 102 and the plurality of data processors (e.g., Data Processor 2 110(2), Data Processor 3 110(3),…, Data Processor Y 110(y)) may be communicatively linked to computing devices. A computing device, as used herein, may be, for example, a web-based server, a local area network server, a cloud-based server, a notebook computer, a desktop computer, an all-in- one system, a tablet computing device, a mobile phone, an electronic book reader, or any other electronic device suitable for provisioning a computing resource to perform a unified visualization interface. The computing device may include a processor and a computer-readable storage medium.
[0045] Figure 2 is a block diagram illustrating one example of a computer readable medium for data allocation based on secure information retrieval. Processing system 200 includes a processor 202, a computer readable medium 208, input devices 204, and output devices 206. Processor 202, computer readable medium 208, input devices 204, and output devices 206 are coupled to each other through a communication link (e.g., a bus).
[0046] Processor 202 executes instructions included in the computer readable medium 208. Computer readable medium 208 includes request receipt instructions 210 to receive, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.
[0047] Computer readable medium 208 includes permutation generation instructions 212 to generate a random permutation based on the hash length.
[0048] Computer readable medium 208 includes secure version of the query term receipt instructions 214 to receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation.
[0049] Computer readable medium 208 includes candidate term receipt instructions 216 to receive, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
[0050] Computer readable medium 208 includes similarity score determination instructions 218 to determine similarity scores between the secure version of the query term and secure versions of the candidate terms.
[0051] Computer readable medium 208 includes target dataset identification instructions 220 to identify the target dataset of the plurality of datasets based on the determined similarity scores.
[0052] Computer readable medium 208 includes query term provide instructions 222 to provide the query term to the identified target dataset.
[0053] Input devices 204 include a keyboard, mouse, data ports, and/or other suitable devices for inputting information into processing system 200. In some examples, input devices 204, such as a computing device, are used by the interaction processor to receive a query term. Output devices 206 include a monitor, speakers, data ports, and/or other suitable devices for outputting information from processing system 200. In some examples, output devices 206 are used to provide the query term to the target dataset.
[0054] As used herein, a“computer readable medium” may be any electronic, magnetic, optical, or other physical storage apparatus to contain or store information such as executable instructions, data, and the like. For example, any computer readable storage medium described herein may be any of Random Access Memory (RAM), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., a hard drive), a solid state drive, and the like, or a combination thereof. For example, the computer readable medium 208 can include one of or multiple different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. [0055] As described herein, various components of the processing system 200 are identified and refer to a combination of hardware and programming configured to perform a designated visualization function. As illustrated in Figure 2, the programming may be processor executable instructions stored on tangible computer readable medium 208, and the hardware may include Processor 202 for executing those instructions. Thus, computer readable medium 208 may store program instructions that, when executed by Processor 202, implement the various components of the processing system 200.
[0056] Such computer readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine- readable instructions can be downloaded over a network for execution.
[0057] Computer readable medium 208 may be any of a number of memory components capable of storing instructions that can be executed by processor 202. Computer readable medium 208 may be non-transitory in the sense that it does not encompass a transitory signal but instead is made up of one or more memory components configured to store the relevant instructions. Computer readable medium 208 may be implemented in a single device or distributed across devices. Likewise, processor 202 represents any number of processors capable of executing instructions stored by computer readable medium 208. Processor 202 may be integrated in a single device or distributed across devices. Further, computer readable medium 208 may be fully or partially integrated in the same device as processor 202 (as illustrated), or it may be separate but accessible to that device and processor 202. In some examples, computer readable medium 208 may be a machine-readable storage medium.
[0058] Figure 3 is a flow diagram illustrating one example of a method for data allocation based on secure information retrieval.
[0059] At 300, a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number. [0060] At 302, a random permutation may be generated based on the hash length.
[0061] At 304, a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.
[0062] At 306, secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
[0063] At 308, similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.
[0064] At 310, a representative term of the collection of candidate terms may be selected for each of the plurality of data processors.
[0065] At 312, a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.
[0066] At 314, the target dataset of the plurality of datasets may be identified based on the comparative statistic.
[0067] In some examples, the plurality of datasets may be a respective plurality of secure storage containers, and the query term may be a data term to be stored in a target storage container associated with the target dataset.
[0068] Figure 4 is a flow diagram illustrating one example of a method for providing a query term to a target dataset in data allocation based on secure information retrieval.
[0069] At 400, a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.
[0070] At 402, a random permutation may be generated based on the hash length.
[0071] At 404, a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.
[0072] At 406, secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
[0073] At 408, similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.
[0074] At 410, a representative term of the collection of candidate terms may be selected for each of the plurality of data processors.
[0075] At 412, a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.
[0076] At 414, the target dataset of the plurality of datasets may be identified based on the comparative statistic.
[0077] At 416, the query term may be provided to the identified target dataset.
[0078] Figure 5 is a flow diagram illustrating one example of a method for associating a query term with a target cluster in a target dataset in data allocation based on secure information retrieval.
[0079] At 500, a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.
[0080] At 502, a random permutation may be generated based on the hash length.
[0081] At 504, a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.
[0082] At 506, secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.
[0083] At 508, similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.
[0084] At 510, a representative term of the collection of candidate terms may be selected for each of the plurality of data processors. [0085] At 512, a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.
[0086] At 514, the target dataset of the plurality of datasets may be identified based on the comparative statistic.
[0087] At 516, a target cluster of the target dataset may be selected based on the determined similarity scores, and the query term may be associated with the target cluster in the target dataset.
[0088] Figure 6 is a flow diagram illustrating one example of a method for providing a second query term to a target dataset in data allocation based on secure information retrieval.
[0089] At 600, a first target dataset associated with a first query term may be identified.
[0090] At 602, a second query term may be received.
[0091] At 604, it may be determined if the second query term is similar to the first query term.
[0092] At 606, upon a determination that the second query term is similar to the first query term, the second query term may be associated with the first target dataset. In some examples, the second query term may be provided to the first target dataset.
[0093] At 608, upon a determination that the second query term is not similar to the first query term, a second target dataset may be identified based on methods described herein, and the second query term may be associated with the second target dataset. In some examples, the second query term may be provided to the second target dataset.
[0094] Examples of the disclosure provide a generalized system for data allocation based on secure information retrieval. The generalized system provides a protocol for identifying storage devices with content similar to an incoming data element in a secure and anonymized manner. The present disclosure focusses on identifying an appropriate data storage device prior to storing the incoming data element, while maintaining the anonymity of the data stored in the storage devices and the incoming data element. [0095] Although the examples are described with a query term in a query dataset, the techniques disclosed herein may be applied to more than one query term in the query dataset. Generation of secure terms based on the transformed data terms ensures that the information processor does not have complete data; so information may not be leaked by the information processor. Additionally, the hash transformation ensures that the information processor only has the hashes. Accordingly, the information processor is unable to regenerate the original data terms in the datasets (i.e., hash-to-data is not possible).
[0096] Although specific examples have been illustrated and described herein, especially as related to numerical data, the examples illustrate applications to any dataset. Accordingly, there may be a variety of alternate and/or equivalent implementations that may be substituted for the specific examples shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein.

Claims

1. A system for data allocation based on secure information retrieval, the system comprising:
an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets, wherein the information processor is to:
receive, from the query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number;
generate a random permutation based on the hash length;
receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation; receive, from each of a plurality of data processors, secure versions of a collection of candidate terms, wherein each candidate term represents a cluster of similar terms in the associated dataset, and wherein the secure versions are based on the hash number and the permutation;
determine similarity scores between the secure version of the query term and secure versions of the candidate terms; and
identify the target dataset of the plurality of datasets based on the determined similarity scores.
2. The system of claim 1, wherein the information processor is to provide the query term to the identified target dataset.
3. The system of claim 1, wherein the information processor is to identify the target dataset by:
select, for each of the plurality of data processors, a representative term of the collection of candidate terms;
provide, to each of the plurality of data processors, the respective representative term;
receive, from each of the plurality of data processors, a comparative statistic between the representative term and its cluster of similar terms; and identify the target dataset based on the comparative statistic.
4. The system of claim 1, wherein the secure terms are based on applying orthogonal transforms to the respective terms.
5. The system of claim 1, wherein the plurality of datasets is a respective plurality of secure storage containers, and the query term is a data term to be stored in a target storage container associated with the target dataset.
6. The system of claim 1, wherein the information processor is to: rank the plurality of datasets based on the determined similarity scores; and provide, to the query processor, the ranked list of datasets.
7. The system of claim 5, wherein the information processor is to identify the target dataset based on the ranking.
8. The system of claim 1, wherein the information processor is to select a target cluster of the target dataset based on the determined similarity scores, and is to associate the query term with the target cluster in the target dataset.
9. The system of claim 1, wherein the information processor is to: receive a second query term from the query processor;
determine if the second query term is similar to the query term;
upon a determination that the second query term is similar to the query term, provide the second query term to the identified target dataset; and
upon a determination that the second query term is not similar to the query term, identify a second target dataset, and provide the second query term to the identified second target dataset.
10. A method for data allocation based on secure information retrieval, the method comprising: receiving, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number;
generating a random permutation based on the hash length;
receiving, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation;
receiving, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, wherein each candidate term represents a cluster of similar terms in the associated dataset, and wherein the secure versions are based on the hash number and the permutation;
determining similarity scores between the secure version of the query term and secure versions of the candidate terms;
selecting, for each of the plurality of data processors, a representative term of the collection of candidate terms;
receiving, from each of the plurality of data processors, a comparative statistic between the representative term and its cluster of similar terms; and
identifying the target dataset of the plurality of datasets based on the comparative statistic.
11. The method of claim 10, comprising providing the query term to the identified target dataset.
12. The method of claim 10, comprising selecting a target cluster of the target dataset based on the determined similarity scores, and associating the query term with the target cluster in the target dataset.
13. The method of claim 10, wherein the plurality of datasets is a respective plurality of secure storage containers, and the query term is a data term to be stored in a target storage container associated with the target dataset.
14. A non-transitory computer readable medium comprising executable instructions to: receive, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number;
generate a random permutation based on the hash length;
receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation;
receive, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, wherein each candidate term represents a cluster of similar terms in the associated dataset, and wherein the secure versions are based on the hash number and the permutation;
determine similarity scores between the secure version of the query term and secure versions of the candidate terms;
identify the target dataset of the plurality of datasets based on the determined similarity scores; and
provide the query term to the identified target dataset.
15. The computer readable medium of claim 14, comprising executable instructions to:
select a target cluster of the target dataset based on the determined similarity scores; and
associate the query term with the target cluster in the target dataset.
PCT/US2015/059934 2015-11-10 2015-11-10 Data allocation based on secure information retrieval WO2017082875A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/US2015/059934 WO2017082875A1 (en) 2015-11-10 2015-11-10 Data allocation based on secure information retrieval
US15/774,708 US10783268B2 (en) 2015-11-10 2015-11-10 Data allocation based on secure information retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/059934 WO2017082875A1 (en) 2015-11-10 2015-11-10 Data allocation based on secure information retrieval

Publications (1)

Publication Number Publication Date
WO2017082875A1 true WO2017082875A1 (en) 2017-05-18

Family

ID=58695033

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/059934 WO2017082875A1 (en) 2015-11-10 2015-11-10 Data allocation based on secure information retrieval

Country Status (2)

Country Link
US (1) US10783268B2 (en)
WO (1) WO2017082875A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11455357B2 (en) * 2019-11-06 2022-09-27 Servicenow, Inc. Data processing systems and methods

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055345A1 (en) * 2002-02-14 2005-03-10 Infoglide Software Corporation Similarity search engine for use with relational databases
US20070094511A1 (en) * 2002-09-06 2007-04-26 The United States Postal Service Method and system for efficiently retrieving secured data by securely pre-processing provided access information
US20100325095A1 (en) * 2009-06-23 2010-12-23 Bryan Stephenson Permuting records in a database for leak detection and tracing
WO2013062941A1 (en) * 2011-10-23 2013-05-02 Microsoft Corporation Telemetry file hash and conflict detection
US20130173917A1 (en) * 2011-12-30 2013-07-04 Christopher J. Clifton Secure search and retrieval

Family Cites Families (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4261043A (en) 1979-08-24 1981-04-07 Northrop Corporation Coefficient extrapolator for the Haar, Walsh, and Hadamard domains
US4751742A (en) 1985-05-07 1988-06-14 Avelex Priority coding of transform coefficients
US6374266B1 (en) 1998-07-28 2002-04-16 Ralph Shnelvar Method and apparatus for storing information in a data processing system
US6263334B1 (en) 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
JP2001043236A (en) 1999-07-30 2001-02-16 Matsushita Electric Ind Co Ltd Synonym extracting method, document retrieving method and device to be used for the same
US7630986B1 (en) 1999-10-27 2009-12-08 Pinpoint, Incorporated Secure data interchange
CA2317081C (en) 2000-08-28 2004-06-01 Ibm Canada Limited-Ibm Canada Limitee Estimation of column cardinality in a partitioned relational database
US6931408B2 (en) 2001-08-17 2005-08-16 E.C. Outlook, Inc. Method of storing, maintaining and distributing computer intelligible electronic data
US20030108242A1 (en) 2001-12-08 2003-06-12 Conant Stephen W. Method and apparatus for processing data
US7369984B2 (en) 2002-02-01 2008-05-06 John Fairweather Platform-independent real-time interface translation by token mapping without modification of application code
JP2003256466A (en) 2002-03-04 2003-09-12 Denso Corp Adaptive information retrieval system
US7269612B2 (en) 2002-05-31 2007-09-11 International Business Machines Corporation Method, system, and program for a policy based storage manager
JP4623920B2 (en) 2002-07-09 2011-02-02 ソニー株式会社 Similarity calculation method and apparatus, program, and recording medium
US7756269B2 (en) 2003-03-14 2010-07-13 Qualcomm Incorporated Cryptosystem for communication networks
US7290150B2 (en) 2003-06-09 2007-10-30 International Business Machines Corporation Information integration across autonomous enterprises
JP2005011042A (en) 2003-06-18 2005-01-13 Shinfuoomu:Kk Data search method, device and program and computer readable recoring medium
US7574409B2 (en) 2004-11-04 2009-08-11 Vericept Corporation Method, apparatus, and system for clustering and classification
JP4940545B2 (en) 2004-12-07 2012-05-30 日産自動車株式会社 Fuel cell system
US20060230086A1 (en) 2005-04-06 2006-10-12 International Business Machines Corporation QoS-enabled lifecycle management for file systems
US7752220B2 (en) 2005-08-10 2010-07-06 Yahoo! Inc. Alternative search query processing in a term bidding system
US7512282B2 (en) 2005-08-31 2009-03-31 International Business Machines Corporation Methods and apparatus for incremental approximate nearest neighbor searching
US7337168B1 (en) 2005-09-12 2008-02-26 Storgae Technology Corporation Holographic correlator for data and metadata search
KR20080035424A (en) 2006-10-19 2008-04-23 엘지전자 주식회사 Data transfer method
US9600509B2 (en) 2007-12-21 2017-03-21 Thomson Reuters Global Resources Systems, methods, and software for entity relationship resolution
US8010782B2 (en) 2008-01-18 2011-08-30 Sap Ag Method and system for mediated secure computation
US8103824B2 (en) 2008-04-17 2012-01-24 International Business Machines Corporation Method for self optimizing value based data allocation across a multi-tier storage system
US8670560B2 (en) * 2008-10-23 2014-03-11 University Of Ulster Encryption method
US8606786B2 (en) 2009-06-22 2013-12-10 Microsoft Corporation Determining a similarity measure between queries
US8407550B2 (en) 2009-08-14 2013-03-26 Mitsubishi Electric Research Laboratories, Inc. Method and system for decoding graph-based codes using message-passing with difference-map dynamics
US8234470B2 (en) 2009-08-25 2012-07-31 International Business Machines Corporation Data repository selection within a storage environment
US8630422B2 (en) 2009-11-10 2014-01-14 International Business Machines Corporation Fully homomorphic encryption method based on a bootstrappable encryption scheme, computer program and apparatus
US9537650B2 (en) 2009-12-15 2017-01-03 Microsoft Technology Licensing, Llc Verifiable trust for data through wrapper composition
US8893300B2 (en) 2010-09-20 2014-11-18 Georgia Tech Research Corporation Security systems and methods to reduce data leaks in enterprise networks
CA2814401C (en) 2010-11-11 2013-12-31 Google Inc. Vector transformation for indexing, similarity search and classification
JP5412414B2 (en) 2010-12-08 2014-02-12 株式会社日立製作所 Searchable cryptographic processing system
US9448695B2 (en) 2010-12-14 2016-09-20 Hewlett-Packard Development Company, L.P. Selecting web page content based on user permission for collecting user-selected content
US8515964B2 (en) * 2011-07-25 2013-08-20 Yahoo! Inc. Method and system for fast similarity computation in high dimensional space
CA2855715C (en) 2011-11-15 2019-02-19 Ab Initio Technology Llc Data clustering based on candidate queries
EP2814980B8 (en) 2012-02-16 2020-06-10 Oxford Nanopore Technologies Limited Analysis of measurements of a polymer
US9252942B2 (en) 2012-04-17 2016-02-02 Futurewei Technologies, Inc. Method and system for secure multiparty cloud computation
US8805793B2 (en) 2012-08-08 2014-08-12 Amazon Technologies, Inc. Data storage integrity validation
JP6055391B2 (en) 2012-11-05 2016-12-27 株式会社デンソーアイティーラボラトリ Relevance determination device, relevance determination program, and relevance determination method
US10423973B2 (en) 2013-01-04 2019-09-24 PlaceIQ, Inc. Analyzing consumer behavior based on location visitation
US9286488B2 (en) 2013-03-13 2016-03-15 Northrop Grumman Systems Corporation System and method for secure database queries
US9280678B2 (en) 2013-12-02 2016-03-08 Fortinet, Inc. Secure cloud storage distribution and aggregation
US10037340B2 (en) 2014-01-21 2018-07-31 Red Hat, Inc. Tiered distributed storage policies

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055345A1 (en) * 2002-02-14 2005-03-10 Infoglide Software Corporation Similarity search engine for use with relational databases
US20070094511A1 (en) * 2002-09-06 2007-04-26 The United States Postal Service Method and system for efficiently retrieving secured data by securely pre-processing provided access information
US20100325095A1 (en) * 2009-06-23 2010-12-23 Bryan Stephenson Permuting records in a database for leak detection and tracing
WO2013062941A1 (en) * 2011-10-23 2013-05-02 Microsoft Corporation Telemetry file hash and conflict detection
US20130173917A1 (en) * 2011-12-30 2013-07-04 Christopher J. Clifton Secure search and retrieval

Also Published As

Publication number Publication date
US10783268B2 (en) 2020-09-22
US20180322304A1 (en) 2018-11-08

Similar Documents

Publication Publication Date Title
US11775656B2 (en) Secure multi-party information retrieval
US10459899B1 (en) Splitting database partitions
US10341103B2 (en) Data analytics on encrypted data elements
TWI694700B (en) Data processing method and device, user terminal
CN107704202B (en) Method and device for quickly reading and writing data
US11100073B2 (en) Method and system for data assignment in a distributed system
KR102194514B1 (en) Method and apparatus for processing transactions
US8769302B2 (en) Encrypting data and characterization data that describes valid contents of a column
US11829503B2 (en) Term-based encrypted retrieval privacy
CN113886418A (en) Data processing method and device, electronic equipment and machine-readable storage medium
Lin et al. Privacy-preserving similarity search with efficient updates in distributed key-value stores
Zhuang et al. Optimizing information leakage in multicloud storage services
US20230315883A1 (en) Method to privately determine data intersection
US20180285693A1 (en) Incremental update of a neighbor graph via an orthogonal transform based indexing
US11080301B2 (en) Storage allocation based on secure data comparisons via multiple intermediaries
CN111159192B (en) Big data based data warehousing method and device, storage medium and processor
US10783268B2 (en) Data allocation based on secure information retrieval
Dai et al. A Multibranch Search Tree‐Based Multi‐Keyword Ranked Search Scheme over Encrypted Cloud Data
US11216432B2 (en) Index data structures and graphical user interface
US9594763B2 (en) N-way Inode translation
CN116450689A (en) Multiparty data query method, computing node and TEE equipment
CN115328950A (en) Secondary index-based hbase query method, terminal device and storage medium
US20190294820A1 (en) Converting plaintext values to pseudonyms using a hash function
US20240171394A1 (en) Storing data in a tree using bilinear accumulation
Zhuang et al. StoreSim: Optimizing information leakage in multicloud storage services

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15908423

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15774708

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15908423

Country of ref document: EP

Kind code of ref document: A1