WO2017082875A1

WO2017082875A1 - Data allocation based on secure information retrieval

Info

Publication number: WO2017082875A1
Application number: PCT/US2015/059934
Authority: WO
Inventors: Mehran KAFAI; Manav DAS
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-11-10
Filing date: 2015-11-10
Publication date: 2017-05-18
Also published as: US10783268B2; US20180322304A1

Abstract

Data allocation based on secure information retrieval is disclosed. One example is a system including an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets. The information processor receives a request from the query processor for identification of a target dataset to be associated with a query term. The information processor generates a random permutation, and receives a secure version of the query term from the query processor, and receives secure versions of a collection of candidate terms from each of a plurality of data processors, each candidate term representing a cluster of similar terms in the associated dataset. The information processor determines similarity scores between the secure version of the query term and secure versions of the candidate terms, and identifies the target dataset of the plurality of datasets based on the determined similarity scores.

Description

DATA ALLOCATION BASED ON SECURE INFORMATION RETRIEVAL Background

[0001] Data storage products, such as data backup devices, may be used to store data that are similar. Separate storage products may store distinct types of data, while a given storage device may store similar data. In some examples, such storage devices may be secured. Data terms may be allocated to, or retrieved from, such storage devices. Brief Description of the Drawings

[0002] Figure 1 is a functional block diagram illustrating one example of a system for data allocation based on secure information retrieval.

[0003] Figure 2 is a block diagram illustrating one example of a computer readable medium for data allocation based on secure information retrieval.

[0004] Figure 3 is a flow diagram illustrating one example of a method for data allocation based on secure information retrieval.

[0005] Figure 4 is a flow diagram illustrating one example of a method for providing a query term to a target dataset in data allocation based on secure information retrieval.

[0006] Figure 5 is a flow diagram illustrating one example of a method for associating a query term with a target cluster in a target dataset in data allocation based on secure information retrieval.

[0007] Figure 6 is a flow diagram illustrating one example of a method for providing a second query term to a target dataset in data allocation based on secure information retrieval. Detailed Description

[0008] Data storage products, especially data backup devices, are often used to store large amounts of similar data. In some instances, human error, or system error outside the backup device, may result in a data item being erroneously copied to a device other than the one for which it was intended. This may result in data loss, and/or accidental exposure of the data item to a third party, potentially with serious legal and/or commercial ramifications.

[0009] Data terms may be allocated to, or retrieved from, such storage devices. In some examples, the storage devices may be secured. In some examples, such storage devices may require data to be encrypted before being stored. In some examples, different storage devices may require different encryptions. Error in data storage may lead to inadvertent security loopholes and/or breaches.

[0010] In some instances, a first party may desire to securely store information in a plurality of storage devices. Based on a volume of data, such situations may result in an increase in a number of secure computations and inter-party data exchanges. Also, for example, there may be intermediate information processors that may not be secure, and/or may have unreliable data protection mechanisms. In such instances, there is a need to not expose all the data from one or more of the parties. Accordingly, there is a need to compute similarity between data distributed over multiple parties, without exposing all the data from any party, and without a need for secure intermediaries.

[0011] Existing systems are generally directed to addressing the need for identifying storage devices with content similar to an incoming data element. However, such systems focus on identifying similarities with the data content after the data element has been stored in a storage device. Accordingly, there is a need to identify an appropriate data storage device prior to storing the incoming data element, while maintaining the anonymity of the data stored in the storage devices and the incoming data element.

[0012] As described in various examples herein, data allocation based on secure information retrieval is disclosed. Data allocation based on secure information retrieval is a secure protocol that allows one party to retrieve information from a plurality of second parties without revealing the data that supports the information. One example is a system including an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets. The information processor receives a request from the query processor for identification of a target dataset to be associated with a query term. The information processor generates a random permutation, and receives a secure version of the query term from the query processor, and receives secure versions of a collection of candidate terms from each of a plurality of data processors, each candidate term representing a cluster of similar terms in the associated dataset. The information processor determines similarity scores between the secure version of the query term and secure versions of the candidate terms, and identifies the target dataset of the plurality of datasets based on the determined similarity scores.

[0013] In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosure may be practiced. It is to be understood that other examples may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. It is to be understood that features of the various examples described herein may be combined, in part or whole, with each other, unless specifically noted otherwise.

[0014] Figure 1 is a functional block diagram illustrating one example of a system 100 for system for data allocation based on secure information retrieval. System 100 is shown to include an information processor 106 communicatively linked to a query processor 102 and a plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)), respectively associated with a plurality of datasets or data containers,

(not shown). The query processor 102, the information processor 106, and the plurality of data processors, Data Processor 1110(1), Data Processor 2110(2),…, Data Processor Y 110(y)), are communicatively linked to one another via a network.

[0015] The term“system” may be used to refer to a single computing device or multiple computing devices that communicate with each other (e.g. via a network) and operate together to provide a unified service. In some examples, the components of system 100 may communicate with one another over a network. As described herein, the network may be any wired or wireless network, and may include any number of hubs, routers, switches, cell towers, and so forth. Such a network may be, for example, part of a cellular network, part of the internet, part of an intranet, and/or any other type of network. In some examples, the network may be a secured network.

[0016] In some examples, the datasets may include collections of data

terms that may be represented as d-dimensional real-valued vectors. A query dataset associated with the query processor 102 may include a query term that it may want to move/copy to one of the plurality of datasets. The goal of the secure storage container allocation process described herein is to minimize information leakage while transferring the query term from the query dataset to a target dataset 116, where the target dataset 116 has a subset of data terms that have high similarity to the query term. Generally, the datasets

may not want to share their information with other parties in the system 100.

[0017] To facilitate the secure storage container allocation process between all parties, a single intermediary processing node may be utilized, such as the information processor 106. For the purposes of this description, the information processor 106 may be assumed to be a semi-honest node, i.e., the information processor 106 follows the protocol as described herein; however, in some instances, it may utilize the messages that it receives to extract more information about the data terms. The query processor 102 sends a request 104 to the information processor 106. The request includes two parameters, a hash length U, and a hash number H indicative of a number of hashes per data term. The integer H may be experimentally determined based on the type and number of data terms in the incoming data stream 102. Generally, U is a very large integer relative to H. In one example, U is a power of 2. The information processor 106 receives, from the query processor 102, the request 104 for identification of a target dataset 116 to be associated with the query term. In some examples, the query term may be an N-dimensional vector with numerical, real-valued components. The term “identification” as used herein, generally refers to identifying a target dataset that may be a suitable destination for the query term. For example, identification may mean identifying a dataset that includes terms that are most similar to the query term. The term“similar” may be used broadly to include any type of similarity for data terms. [0018] As described herein, in some examples, system 100 may be provided with values for hash count H and hash universe size U. Generally, U is a very large integer relative to H and N. In some examples, U is a power of 2. In some examples, each 6000-dimensional vector (N = 6000) may be associated with 100 integers (H = 100) selected from the set {1, 2, 3,…, 2¹⁸} (U = 2¹⁸). Accordingly, the hash transform may transform a higher dimensional data term (e.g. with 6000 dimensions) into a lower dimensional transformed data term (e.g. with 100 dimensions). The information processor 106 generates a random permutation based on the hash length U, and sends the permutation to the query processor 102 and the plurality of data processors (e.g., Data Processor 1110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)). Each of the plurality of data processors also receives the hash number H.

[0019] The query processor 102 and the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)) respectively apply a predetermined orthogonal transform to the query term and the data terms in the respective datasets, and select the top hashes. For example, the transformation of the query term and the plurality of terms may be based on the hash number H. In some examples, the hash transform may be an orthogonal transformation. The term“orthogonal transformation” as used herein generally refers to a linear transformation between two linear spaces that preserves their respective linear structure (e.g., preserves an inner product). In some examples, the hash transform may be a Walsh-Hadamard transformation (“WHT”). In some examples, hash transform may be the WHT applied to the query term and the plurality of terms to provide coefficients of the WHT. A WHT is an orthogonal, non- sinusoidal transform that takes a signal as input and outputs a set of basis functions. The output functions are known as Walsh functions. A Walsh function takes two values: +1 and -1.

[0020] Generally, performing an orthogonal transform on a data term provides a set of coefficients associated with the data term. For example, after application of the WHT to the data term, the information processor 106 may provide a collection of coefficients Alternative forms of orthogonal transforms may be

utilized as well. For example, a cosine transform may be utilized. In some examples, the largest H coefficients (based on the hash number) of the WHT may comprise a transformed query term and a plurality of transformed data terms.

[0021] In some examples, the H largest coefficients, may be

selected as the subset of hashes associated with a data term, say A. Table 1 illustrates an exam le association of data terms A, B, and C, with sets of hashes:

[0022] In some examples, the information processor 106 may determine an index of hash positions based on the orthogonal transform. Table 2 illustrates an example index of hash positions for data terms with U = 2⁴ key positions {1, 2,…, 16} with data terms A, B, and C, based on sets of H = 5 hashes in Table 1:

[0023] As illustrated, positions 1 and 5 are indicative of data terms A and C since these data terms have hashes at the hash positions 1 and 5, as illustrated in Table 1. Likewise, position 13 is indicative of data terms A and B since these data terms have hashes at the hash position 13, as illustrated in Table 1.

[0024] In some examples, as described herein, the query term and the plurality of terms in the datasets may be transformed based on the random permutation. For example, the transformation may comprise an extension of a numerical vector by concatenating it with itself to generate a vector of length U. In some examples, the permutation may be utilized to generate a transformed query term. For example, the query term (e.g. a numerical vector) may be extended, and the permutation may be applied to the extended vector, and then the orthogonal transform may be applied to generate a transformed query term. Accordingly, the result is an H- dimensional feature vector that represents a secure version of the query term. Secure versions of other data terms may be generated in like manner in the respective datasets. The term“secure version” as used herein, generally refers to a transformed version of a data term (e.g., the transformed query term as described herein) that includes hashes that encode significant information about the data term. Such a transformed version is secure because it is generally difficult to identify the components of the original data term from the hashed version. At the same time, the secure version offers an easy way to compare data terms for significant overlaps that are indicative of a high degree of similarity, without compromising the contents of the individual data terms.

[0025] In some examples, the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)) generate clusters of a plurality of terms in a dataset associated with a respective data processor, the clusters being based on similarity scores for pairs of terms, and the data processor selects a candidate term from each cluster. For example, Data Processor 1110(1) may be associated with a Dataset 1 including a plurality of terms (e.g., term 1, term 2,…, term X), and may generate clusters of the plurality of terms (e.g., term 1, term 2,…, term X).

[0026] A similarity score between two data terms may be determined based on a number of common hashes. This provides an approximate measure of similar data terms. The similarity score may be based on a number of overlaps between respective sets of hashes, and is indicative of proximity of the pair of data terms. Table 3 illustrates an example determination of similarity scores for pairs formed from the data terms A, B, and C:

[0027] As illustrated in Table 2, the data terms A and B have position 13 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (A,B), denoted as S(A,B) may be determined to be 1. Also, for example, as illustrated in Table 2, the data terms A and C have positions 1 and 5 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (A,C), denoted as S(A,C) may be determined to be 2. As another example, as illustrated in Table 2, the data terms B and C have position 7 in common in their respective sets of hashes. Accordingly, the similarity score for the pair (B,C), denoted as S(B,C) may be determined to be 1.

[0028] A dataset may be partitioned into, say G clusters by grouping similar data terms together, where similar data terms are based on similarity scores. In some examples, a partitioning around k-medoids (“PAM”) algorithm may be utilized to obtain G candidate terms from the G clusters, the candidate terms denoted as where In some examples, G may be chosen based on

heuristics, for example, G = 2V.

[0029] The information processor 106 receives, from the query processor 102, the secure version of the query term, where the secure version is based on the hash number and the permutation, as described herein. Likewise, the information processor 106 receives, from each of the plurality of data processors (e.g., Data Processor 1110(1), Data Processor 2110(2),…, Data Processor Y 110(y)), secure versions of a collection of candidate terms, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.

[0030] As described herein, the information processor 106 determines similarity scores between the secure version of the query term and the secure versions of the candidate terms, where the similarity score is indicative of proximity of the secure versions of the candidate terms to the secure version of the query term, and based on shared data elements between the secure version of the query term and the secure versions of the candidate terms. In some examples, the secure version of the query term and each secure candidate term may be of same length, and the similarity score may be a ratio of a number of shared data elements between the secure version of the query term and a secure candidate term to that length.

[0031] The information processor 106 identifies the target dataset 116 of the plurality of datasets based on the determined similarity scores. In some examples, the information processor 106 selects, for each of the plurality of data processors (e.g., Data Processor 1 110(1), Data Processor 2 110(2),…, Data Processor Y 110(y)), a representative term of the collection of candidate terms. For example, the information processor 106 may compute the similarity between the secure version of the query term and the secure versions of the candidate terms received from all the data processors by measuring overlaps of the hashes. The information processor 106 may then determine, for each data processor, the top candidate term defined as the candidate term with the highest similarity to the query term, and select this top candidate term as the representative term for the data processor. In some examples, more than one representative term may be selected for a data processor. Thereafter, the information processor 106 provides, to each of the plurality of data processors, the respective representative terms.

[0032] In some examples, each data processor may approximate the similarity between the secure version of the query term and the data terms from the cluster represented by the representative term determined by the information processor 106. Each data processor then computes a comparative statistic of all the approximated similarity scores between the query term and the terms in the cluster associated with the representative term. The term“comparative statistic” as used herein may be any statistical representation that captures a summary of the data (e.g., approximated similarity scores). In some examples, the comparative statistic may be one of a mean, median, or mode of the approximated similarity scores. In some examples, each data processor provides the comparative statistic between the representative term and its cluster of similar terms, to the information processor 106. The information processor 106 identifies the target dataset 116 based on the comparing. For example, the information processor 106 may identify a dataset with the highest statistic.

[0033] In some examples, the data processor may determine the comparative statistic between the secure version of the query term and the secure data terms in the cluster associated with the representative term without knowledge of the query term, where the determination is based on the similarity scores (determined at the information processor 106) between the secure version of the query term and the secure version of the representative term. The main idea behind the selection of the target dataset 116 is to estimate the similarity between the secure version of the query term and all secure terms in the data processor, Data Processor 1110(1), in the cluster associated with the representative term by only knowing the similarity score between the secure version of the query term and the secure version of the representative term from the data processor, Data Processor 1110(1).

[0034] Accordingly, in the secure information retrieval described herein, the data processors each share a secure version of the representative term with the information processor 106, and the query processor 102 only shares a secure version of the query term with the information processor 106. Accordingly, the information processor 106 is not privy to the actual composition of the query term in the query processor 102, and the plurality of terms in the dataset associated with a data processor, say Data Processor 1 110(1). Also, for example, the query processor 102 has no knowledge of the actual composition of the plurality of terms in the dataset associated with a data processor, say Data Processor 1 110(1). Likewise, the dataset associated with a data processor, say Data Processor 1 110(1), has no knowledge of the actual composition of the query term in the query processor 102. The information processor 106 computes similarity scores between the secure version of the query term and the secure versions of the representative terms, and provides the determined similarity scores to the data processor. The data processor, in turn, utilizes the comparative distribution techniques disclosed herein, to determine similarity scores between the secure version of the query term and the plurality of secure versions of data terms based on the similarity scores between the secure version of the query term and the secure versions of the representative terms received from the information processor 106.

[0035] As described herein, another advantage of such indirect determination of similarity scores is that if the query processor 102 requests an additional target dataset for a second query term, the same secure versions of the representative terms may be utilized again to select the additional target dataset.

[0036] In some examples, the information processor 106 provides the query term to the identified target dataset 116. For example, the plurality of datasets may be a respective plurality of secure storage containers, and the query term may be a data term to be stored in a target storage container associated with the target dataset 116. Accordingly, the information processor 106 is able to provide the query term to the data container that has data elements that are most similar to the query term, thereby minimizing errors in data allocation. In some examples, the information processor 106 provides the identity of the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target dataset 116.

[0037] In some examples, the information processor 106 may select a target cluster of the target dataset 116 based on the determined similarity scores, and may associate the query term with the target cluster in the target dataset 116. For example, the target cluster may be the cluster associated with the representative term. In some examples, the target dataset 116 may be a secure storage container, and the information processor 106 may associate the query term with a cluster in the secure storage container. In some examples, the secure storage container selected as the target dataset 116 may comprise partitions for data storage, and the information processor 106 may associate the query term with a partition of the secure storage container.

[0038] In some examples, the information processor 106 provides the identity of the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target dataset 116. In some examples, the information processor 106 provides the identity of the target cluster in the target dataset 116 to the query processor 102, which, in turn, sends the query term directly to the identified target cluster in the target dataset 116.

[0039] In some examples the information processor 106 may receive a second query term from the query processor 102. In some examples, the information processor 106 may determine if the second query term is similar to the query term, and upon a determination that the second query term is similar to the query term, provide the second query term to the identified target dataset 116. In some examples, upon a determination that the second query term is not similar to the query term, the information processor 106 may identify a second target dataset, and provide the second query term to the identified second target dataset.

[0040] In some examples, the information processor 106 may rank the plurality of datasets based on the determined similarity scores, and/or the comparative statistic. For example, the representative terms may be ranked based on respective similarity scores. Accordingly, the associated datasets may be ranked based on the ranking for the representative terms. In some examples, the comparative statistic may be utilized to rank the plurality of datasets, and the information processor 106 may provide a list of top-k datasets to the query processor 102. The query processor 102 may then prompt the information processor 106 to provide the query term to a sub-plurality of the top-k datasets. In some examples, the information processor 106 may rank the clusters in a given dataset, and may provide the query term to a sub-plurality of the clusters based on the determined ranking.

[0041] The components of system 100 may be computing resources, each including a suitable combination of a physical computing device, a virtual computing device, a network, software, a cloud infrastructure, a hybrid cloud infrastructure that may include a first cloud infrastructure and a second cloud infrastructure that is different from the first cloud infrastructure, and so forth. The components of system 100 may be a combination of hardware and programming for performing a designated visualization function. In some instances, each component may include a processor and a memory, while programming code is stored on that memory and executable by a processor to perform a designated visualization function.

[0042] For example, the information processor 106 may be a combination of hardware and programming for performing a designated function. For example, the information processor 106 may include programming to receive the query term and the candidate terms, and determine similarity scores for the query term and the candidate terms. The information processor 106 may include hardware to physically store the similarity scores, and processors to physically process the received terms and determined similarity scores. Also, for example, information processor 106 may include software programming to dynamically interact with the other components of system 100.

[0043] Generally, the components of system 100 may include programming and/or physical networks to be communicatively linked to other components of system 100. In some instances, the components of system 100 may include a processor and a memory, while programming code is stored and on that memory and executable by a processor to perform designated functions.

[0044] Generally, the query processor 102 and the plurality of data processors (e.g., Data Processor 2 110(2), Data Processor 3 110(3),…, Data Processor Y 110(y)) may be communicatively linked to computing devices. A computing device, as used herein, may be, for example, a web-based server, a local area network server, a cloud-based server, a notebook computer, a desktop computer, an all-in- one system, a tablet computing device, a mobile phone, an electronic book reader, or any other electronic device suitable for provisioning a computing resource to perform a unified visualization interface. The computing device may include a processor and a computer-readable storage medium.

[0045] Figure 2 is a block diagram illustrating one example of a computer readable medium for data allocation based on secure information retrieval. Processing system 200 includes a processor 202, a computer readable medium 208, input devices 204, and output devices 206. Processor 202, computer readable medium 208, input devices 204, and output devices 206 are coupled to each other through a communication link (e.g., a bus).

[0046] Processor 202 executes instructions included in the computer readable medium 208. Computer readable medium 208 includes request receipt instructions 210 to receive, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.

[0047] Computer readable medium 208 includes permutation generation instructions 212 to generate a random permutation based on the hash length.

[0048] Computer readable medium 208 includes secure version of the query term receipt instructions 214 to receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation.

[0049] Computer readable medium 208 includes candidate term receipt instructions 216 to receive, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.

[0050] Computer readable medium 208 includes similarity score determination instructions 218 to determine similarity scores between the secure version of the query term and secure versions of the candidate terms.

[0051] Computer readable medium 208 includes target dataset identification instructions 220 to identify the target dataset of the plurality of datasets based on the determined similarity scores.

[0052] Computer readable medium 208 includes query term provide instructions 222 to provide the query term to the identified target dataset.

[0053] Input devices 204 include a keyboard, mouse, data ports, and/or other suitable devices for inputting information into processing system 200. In some examples, input devices 204, such as a computing device, are used by the interaction processor to receive a query term. Output devices 206 include a monitor, speakers, data ports, and/or other suitable devices for outputting information from processing system 200. In some examples, output devices 206 are used to provide the query term to the target dataset.

[0054] As used herein, a“computer readable medium” may be any electronic, magnetic, optical, or other physical storage apparatus to contain or store information such as executable instructions, data, and the like. For example, any computer readable storage medium described herein may be any of Random Access Memory (RAM), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., a hard drive), a solid state drive, and the like, or a combination thereof. For example, the computer readable medium 208 can include one of or multiple different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. [0055] As described herein, various components of the processing system 200 are identified and refer to a combination of hardware and programming configured to perform a designated visualization function. As illustrated in Figure 2, the programming may be processor executable instructions stored on tangible computer readable medium 208, and the hardware may include Processor 202 for executing those instructions. Thus, computer readable medium 208 may store program instructions that, when executed by Processor 202, implement the various components of the processing system 200.

[0056] Such computer readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine- readable instructions can be downloaded over a network for execution.

[0057] Computer readable medium 208 may be any of a number of memory components capable of storing instructions that can be executed by processor 202. Computer readable medium 208 may be non-transitory in the sense that it does not encompass a transitory signal but instead is made up of one or more memory components configured to store the relevant instructions. Computer readable medium 208 may be implemented in a single device or distributed across devices. Likewise, processor 202 represents any number of processors capable of executing instructions stored by computer readable medium 208. Processor 202 may be integrated in a single device or distributed across devices. Further, computer readable medium 208 may be fully or partially integrated in the same device as processor 202 (as illustrated), or it may be separate but accessible to that device and processor 202. In some examples, computer readable medium 208 may be a machine-readable storage medium.

[0058] Figure 3 is a flow diagram illustrating one example of a method for data allocation based on secure information retrieval.

[0059] At 300, a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number. [0060] At 302, a random permutation may be generated based on the hash length.

[0061] At 304, a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.

[0062] At 306, secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.

[0063] At 308, similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.

[0064] At 310, a representative term of the collection of candidate terms may be selected for each of the plurality of data processors.

[0065] At 312, a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.

[0066] At 314, the target dataset of the plurality of datasets may be identified based on the comparative statistic.

[0067] In some examples, the plurality of datasets may be a respective plurality of secure storage containers, and the query term may be a data term to be stored in a target storage container associated with the target dataset.

[0068] Figure 4 is a flow diagram illustrating one example of a method for providing a query term to a target dataset in data allocation based on secure information retrieval.

[0069] At 400, a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.

[0070] At 402, a random permutation may be generated based on the hash length.

[0071] At 404, a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.

[0072] At 406, secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.

[0073] At 408, similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.

[0074] At 410, a representative term of the collection of candidate terms may be selected for each of the plurality of data processors.

[0075] At 412, a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.

[0076] At 414, the target dataset of the plurality of datasets may be identified based on the comparative statistic.

[0077] At 416, the query term may be provided to the identified target dataset.

[0078] Figure 5 is a flow diagram illustrating one example of a method for associating a query term with a target cluster in a target dataset in data allocation based on secure information retrieval.

[0079] At 500, a request may be received from a query processor, for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number.

[0080] At 502, a random permutation may be generated based on the hash length.

[0081] At 504, a secure version of the query term may be received from the query processor, the secure version based on the hash number and the permutation.

[0082] At 506, secure versions of a collection of candidate terms may be received from each of the plurality of data processors associated with a plurality of datasets, where each candidate term represents a cluster of similar terms in the associated dataset, and where the secure versions are based on the hash number and the permutation.

[0083] At 508, similarity scores may be determined between the secure version of the query term and secure versions of the candidate terms.

[0084] At 510, a representative term of the collection of candidate terms may be selected for each of the plurality of data processors. [0085] At 512, a comparative statistic between the representative term and its cluster of similar terms may be received from each of the plurality of data processors.

[0086] At 514, the target dataset of the plurality of datasets may be identified based on the comparative statistic.

[0087] At 516, a target cluster of the target dataset may be selected based on the determined similarity scores, and the query term may be associated with the target cluster in the target dataset.

[0088] Figure 6 is a flow diagram illustrating one example of a method for providing a second query term to a target dataset in data allocation based on secure information retrieval.

[0089] At 600, a first target dataset associated with a first query term may be identified.

[0090] At 602, a second query term may be received.

[0091] At 604, it may be determined if the second query term is similar to the first query term.

[0092] At 606, upon a determination that the second query term is similar to the first query term, the second query term may be associated with the first target dataset. In some examples, the second query term may be provided to the first target dataset.

[0093] At 608, upon a determination that the second query term is not similar to the first query term, a second target dataset may be identified based on methods described herein, and the second query term may be associated with the second target dataset. In some examples, the second query term may be provided to the second target dataset.

[0094] Examples of the disclosure provide a generalized system for data allocation based on secure information retrieval. The generalized system provides a protocol for identifying storage devices with content similar to an incoming data element in a secure and anonymized manner. The present disclosure focusses on identifying an appropriate data storage device prior to storing the incoming data element, while maintaining the anonymity of the data stored in the storage devices and the incoming data element. [0095] Although the examples are described with a query term in a query dataset, the techniques disclosed herein may be applied to more than one query term in the query dataset. Generation of secure terms based on the transformed data terms ensures that the information processor does not have complete data; so information may not be leaked by the information processor. Additionally, the hash transformation ensures that the information processor only has the hashes. Accordingly, the information processor is unable to regenerate the original data terms in the datasets (i.e., hash-to-data is not possible).

[0096] Although specific examples have been illustrated and described herein, especially as related to numerical data, the examples illustrate applications to any dataset. Accordingly, there may be a variety of alternate and/or equivalent implementations that may be substituted for the specific examples shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein.

Claims

1. A system for data allocation based on secure information retrieval, the system comprising:

an information processor communicatively linked to a query processor and a plurality of data processors respectively associated with a plurality of datasets, wherein the information processor is to:

receive, from the query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number;

generate a random permutation based on the hash length;

receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation; receive, from each of a plurality of data processors, secure versions of a collection of candidate terms, wherein each candidate term represents a cluster of similar terms in the associated dataset, and wherein the secure versions are based on the hash number and the permutation;

determine similarity scores between the secure version of the query term and secure versions of the candidate terms; and

identify the target dataset of the plurality of datasets based on the determined similarity scores.

2. The system of claim 1, wherein the information processor is to provide the query term to the identified target dataset.

3. The system of claim 1, wherein the information processor is to identify the target dataset by:

select, for each of the plurality of data processors, a representative term of the collection of candidate terms;

provide, to each of the plurality of data processors, the respective representative term;

receive, from each of the plurality of data processors, a comparative statistic between the representative term and its cluster of similar terms; and identify the target dataset based on the comparative statistic.

4. The system of claim 1, wherein the secure terms are based on applying orthogonal transforms to the respective terms.

5. The system of claim 1, wherein the plurality of datasets is a respective plurality of secure storage containers, and the query term is a data term to be stored in a target storage container associated with the target dataset.

6. The system of claim 1, wherein the information processor is to: rank the plurality of datasets based on the determined similarity scores; and provide, to the query processor, the ranked list of datasets.

7. The system of claim 5, wherein the information processor is to identify the target dataset based on the ranking.

8. The system of claim 1, wherein the information processor is to select a target cluster of the target dataset based on the determined similarity scores, and is to associate the query term with the target cluster in the target dataset.

9. The system of claim 1, wherein the information processor is to: receive a second query term from the query processor;

determine if the second query term is similar to the query term;

upon a determination that the second query term is similar to the query term, provide the second query term to the identified target dataset; and

upon a determination that the second query term is not similar to the query term, identify a second target dataset, and provide the second query term to the identified second target dataset.

10. A method for data allocation based on secure information retrieval, the method comprising: receiving, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number;

generating a random permutation based on the hash length;

receiving, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation;

receiving, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, wherein each candidate term represents a cluster of similar terms in the associated dataset, and wherein the secure versions are based on the hash number and the permutation;

determining similarity scores between the secure version of the query term and secure versions of the candidate terms;

selecting, for each of the plurality of data processors, a representative term of the collection of candidate terms;

receiving, from each of the plurality of data processors, a comparative statistic between the representative term and its cluster of similar terms; and

identifying the target dataset of the plurality of datasets based on the comparative statistic.

11. The method of claim 10, comprising providing the query term to the identified target dataset.

12. The method of claim 10, comprising selecting a target cluster of the target dataset based on the determined similarity scores, and associating the query term with the target cluster in the target dataset.

13. The method of claim 10, wherein the plurality of datasets is a respective plurality of secure storage containers, and the query term is a data term to be stored in a target storage container associated with the target dataset.

14. A non-transitory computer readable medium comprising executable instructions to: receive, from a query processor, a request for identification of a target dataset to be associated with a query term, the request including a hash length and a hash number;

generate a random permutation based on the hash length;

receive, from the query processor, a secure version of the query term, the secure version based on the hash number and the permutation;

receive, from each of the plurality of data processors associated with a plurality of datasets, secure versions of a collection of candidate terms, wherein each candidate term represents a cluster of similar terms in the associated dataset, and wherein the secure versions are based on the hash number and the permutation;

determine similarity scores between the secure version of the query term and secure versions of the candidate terms;

identify the target dataset of the plurality of datasets based on the determined similarity scores; and

provide the query term to the identified target dataset.

15. The computer readable medium of claim 14, comprising executable instructions to:

select a target cluster of the target dataset based on the determined similarity scores; and

associate the query term with the target cluster in the target dataset.