WO2018080857A1 - Systems and methods for creating, storing, and analyzing secure data - Google Patents
Systems and methods for creating, storing, and analyzing secure data Download PDFInfo
- Publication number
- WO2018080857A1 WO2018080857A1 PCT/US2017/057075 US2017057075W WO2018080857A1 WO 2018080857 A1 WO2018080857 A1 WO 2018080857A1 US 2017057075 W US2017057075 W US 2017057075W WO 2018080857 A1 WO2018080857 A1 WO 2018080857A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- secure
- query
- client
- schema
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 230000006870 function Effects 0.000 claims description 57
- 239000003205 fragrance Substances 0.000 claims description 47
- 230000009466 transformation Effects 0.000 claims description 44
- 125000003118 aryl group Chemical group 0.000 claims description 33
- 238000004458 analytical method Methods 0.000 claims description 27
- 239000013598 vector Substances 0.000 claims description 20
- 239000003795 chemical substances by application Substances 0.000 claims description 15
- 238000004891 communication Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 10
- 238000012517 data analytics Methods 0.000 claims description 8
- 238000012550 audit Methods 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 description 37
- 150000003839 salts Chemical class 0.000 description 13
- 230000000875 corresponding effect Effects 0.000 description 11
- 238000013139 quantization Methods 0.000 description 8
- 230000008859 change Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 210000003370 receptor cell Anatomy 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 239000008186 active pharmaceutical agent Substances 0.000 description 2
- 238000013501 data transformation Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010304 firing Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 210000003928 nasal cavity Anatomy 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000008786 sensory perception of smell Effects 0.000 description 2
- TVZRAEYQIKYCPH-UHFFFAOYSA-N 3-(trimethylsilyl)propane-1-sulfonic acid Chemical compound C[Si](C)(C)CCCS(O)(=O)=O TVZRAEYQIKYCPH-UHFFFAOYSA-N 0.000 description 1
- 101001092930 Homo sapiens Prosaposin Proteins 0.000 description 1
- 102100036197 Prosaposin Human genes 0.000 description 1
- 241001441724 Tetraodontidae Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000027455 binding Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009149 molecular binding Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0281—Proxies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
- H04L63/0471—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload applying encryption by an intermediary, e.g. receiving clear information at the intermediary and encrypting the received information at the intermediary before forwarding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/12—Applying verification of the received information
- H04L63/123—Applying verification of the received information received data contents, e.g. message integrity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/008—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/06—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
- H04L9/0643—Hash functions, e.g. MD5, SHA, HMAC or f9 MAC
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/08—Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
- H04L9/0894—Escrow, recovery or storing of secret information, e.g. secret key escrow or cryptographic key storage
Definitions
- This invention relates to systems and methods for creating, storing, and analyzing secure data stores.
- a user uploads a set of client data and set of correlations between data fields from a client subsystem to a gateway subsystem.
- the gateway subsystem applies a schema to generate a data structure from the client data that is then transformed.
- the transformed data structure is then transmitted and stored in a secure database on a server subsystem that is administratively separate from the gateway subsystem.
- Coded keys and schema solutions are only stored on the gateway subsystem, and are not accessible by the client subsystem or the server subsystem.
- hashing and encryption are two methods for securing information when it is being transmitted on the Internet or stored at rest. They can both help satisfy regulatory requirements such as those under PCI DSS, HIPAA-HITECH, GLBA, ITAR, and the EU GDPR. While hashing and encryption are both effective data obfuscation technologies, they are not the same thing, and each technology has its own strengths and weaknesses. In some cases, such as with electronic payment data, both encryption and hashing are used to secure the end-to-end process.
- hashing when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent from which it is mathematically infeasible to reverse engineer the equivalent to determine the semantic content of the original data element.
- Tokenization is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value.
- U.S. Patent Number 9,336,256 to Boukobza shows a data tokenization system that includes a token vault.
- Boukobza discloses a database network router ("DN "), which serves as the an intermediary node between an application and a tokenized database.
- the database communicates directly with a token vault through the use of a DNR software agent running on the database that parses and executes commands received as part of database access requests from the DNR.
- the application can be decoupled from the database, and the burden of integrating tokenization APIs and performing tokenization or de-tokenization functions can be shifted to the DNR and the DNR software agent.
- multiple tokenization vendors can utilize the DNR and a DNR software agent as the interface between their own databases and applications.
- Tokenization methods like that disclosed in Boukobza have drawbacks. Such methods require a lookup table or token store/vault that contains mappings between tokens and actual values. These token stores or vaults may contain sensitive data in cleartext and represent an alternate breach target. And, for large databases, translation tables can become costly to scale and maintain due to the need for synchronization and high availability. It is also difficult (and slow) to exchange data, since such methods require direct access to a token vault mapping token values.
- hashing is the process of transforming a string of characters or the value into a (usually) shorter fixed-length value or key that represents the original string. Hashing can be used to determine if there has been a minor change in the original since every such change will result in a different hashed value. Hashing can also be used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value.
- a hash function is any function that can be used to map data of arbitrary size to data of fixed size.
- the values returned by a hash function are called hash values, hash codes, digests, or simply "hashes.”
- hash values One use of hashed values is a data structure called a hash table, widely used in computer software for rapid data lookup.
- Hashing is a one-way function that scrambles plain text to produce a unique message digest. With a properly designed algorithm, there is no way to reverse the hashing process to reveal the original data.
- Popular hashing algorithms include MD5, SHA-0, SHA-1, and SHA-2.
- Encryption is a two-way function; what is encrypted can be decrypted with the proper key.
- hashing thus encrypts data in a one-way fashion. If the universe of values being hashed is known, an attacker that has access to the same hashing algorithm can build rainbow tables (i.e., pre-calculated tables for a specific hash) of possible known values to search for target data. Hashing, thus, can be insecure where there is a reasonably finite quantity of input values and a known hash function. Hashing can be secure, however, when either the hash function is unknown or the universe of input values is too large for the creation of a useful rainbow table.
- Encryption is the process of using an algorithm to transform plain text information into a non-readable form called ciphertext. An algorithm and an encryption key are required to decrypt the information and return it to its original plain text format.
- SSL encryption is commonly used to protect information as it is transmitted on the Internet. Using built-in encryption capabilities of operating systems or third-party encryption tools, millions of people encrypt data on their computers to protect against the accidental loss of sensitive data in the event their computer is stolen. Encryption can be used to thwart government surveillance and theft of sensitive corporate data when a secure algorithm and reasonable encryption keys are used.
- Encryption can be performed with software such as PGP, or with custom software or functions built into operating systems or encryption APIs.
- U.S. Patent Number 9,087,212 to Balak ishnan et al. discusses a sequence of steps for implementing database confidentiality.
- a database proxy intercepts SQL queries and rewrites the queries to execute on encrypted data.
- the system guards against an external attacker with full access to the data stored in a database management system ("DBMS") server.
- DBMS database management system
- the attacker is assumed to be passive, i.e., wants to learn confidential data, but does not change queries issued by the application, query results, or the data in the DBMS. That threat includes DBMS software compromises, root access to DBMS machines, and even access to the RAM of physical machines.
- the proxy stores a secret master key MK, a database schema, and the current encryption layers of all columns.
- the system executes SQL queries over encrypted data.
- the DBMS machines and administrators are not trusted, but the application and the proxy are trusted.
- the system enables the DBMS server to execute SQL queries on encrypted data almost as if it were executing the same queries on plaintext data so that existing DBMSes do not need to be changed.
- the DBMS query plan for an encrypted query is typically the same as for the original query, except that the operators comprising the query, such as selections, projections, joins, aggregates, and orderings, are performed on ciphertexts, and use modified operators in some cases.
- the DBMS server returns the (encrypted) query result, which the proxy decrypts and returns to the application.
- Homomorphic encryption is a form of encryption that allows computations to be carried out on ciphertext, thus generating an encrypted result which, when decrypted, matches the result of operations performed on the plaintext.
- These techniques imply the requirement of one or more (partially) homomorphic encryption schemes to manipulate column (field) values in a meaningful way while encrypted.
- Boneh et al. discloses a private database query system.
- a database server is split into two entities (called “server” and “proxy”), and privacy holds only so long as these two entities do not collude.
- Boneh encodes the database as one or more polynomials, manipulating these polynomials using a clients' query so as to obtain a new polynomial whose roots are the indexes of the matching records.
- tg is a tag
- A(x) is a polynomial whose roots are exactly the records indexes r that contain this attribute-value pair.
- Homomorphic encryption techniques like that described in Boneh also have several drawbacks. Computations are highly compute-intensive and have been demonstrated to perform poorly even at small scale. Further, values in these techniques are mutable and may be changed or have encrypted values computed as the result of homomorphic computations.
- a security model that does not suffer from these and the foregoing drawbacks is therefore desired, particularly for use in data analytics applications. More particularly, a model that relies on proven, strong encryption algorithms, includes simple logic operations that have been demonstrated to work efficiently at massive scale so that they can be easily distributed for parallel computing across many systems, and does not operate on the original data, but rather, a network of encrypted values that represent the original data, is desired.
- Embodiments of solutions to the problems described above include improved systems and methods of performing data analytics.
- a system according to the present invention comprises a database designed to support data encryption in transit, at rest, and in use.
- the system applies a transformation process to represent user data as a network of ciphered facts.
- This network of facts can be queried and analyzed using instructions that have been ciphered using the same process that ciphered the data itself.
- the transformed data is a complete representation of the original pre-transformed data and can be transformed back into its original cleartext format without the need to maintain cumbersome translation tables such as those required by data masking and tokenization mechanisms.
- a system according to the present invention is divided into three administratively distinct subsystems that ensures no single domain breach can result in the loss of useable data. The subsystems are referred to "client,” “gateway,” and "server.”
- the client subsystem creates and submits queries and analytical routines in user-readable format.
- the gateway subsystem for example, stores schema and secret information, performs data transformation processes, and enforces access control policies.
- the server subsystem for example, writes the transformed encrypted data to storage, and executes distributed read queries and analysis operations on encrypted data.
- queries and analysis routines originate in the client subsystem, are transformed in the gateway subsystem, and are then executed by server subsystem.
- the server system collects the results of read queries and analyses and relays them to the gateway subsystem, where they may be transformed and filtered using data access policies before being delivered in user-readable format to the requesting client subsystem.
- a system comprises client, gateway, and server subsystems.
- the client subsystem is the connection point between users or external applications and the system.
- the gateway subsystem stores and secures a database schema and secret keys, performs data transformation and encryption, and enforces data access policies.
- the server subsystem provides bulk data storage and distributed- computing resources to perform advanced queries and analysis upon stored, secure data.
- the gateway subsystem serves as the interconnection service between the client subsystem and server subsystem.
- the users and applications accessing the client subsystem create queries and analysis routines and forward them to gateway subsystem.
- the gateway subsystem transforms queries and analysis routines received from the client subsystem into a ciphered-representation and relays the representation to the server subsystem.
- the server subsystem executes the queries and analysis operations and returns one or more result sets to the gateway subsystem.
- the gateway subsystem may transform the ciphered data back into its original form, apply any fine-grained data access policies that are configured, and relay the policy-filtered results to the client subsystem and on to the requesting users and applications.
- the three-subsystem architecture makes it possible to distribute the components necessary for a threat actor to obtain sensitive data into separate administrative domains.
- the gateway subsystem does not share secret keys or schema information with the server subsystem. While such an embodiment can work with one client, multiple clients each having access to a subset of the source data can also be used. In such embodiments, each client subsystem has access to a subset of the schema, but never has access to the secret keys. All operations within the server subsystem are performed using a ciphered network of facts, ciphered-query and analysis parameters, and pre-defined collections of logic instructions.
- systems according to the present invention may provide audit reporting in both the gateway and server subsystems.
- the gateway subsystem audit may report the original query and a non-repudiated query signature provided by the originating client.
- the query signature is included with the transformed query when it is relayed to the server subsystem.
- the server subsystem report includes the query signature and the identity of the gateway subsystem that relayed it. Reports from the two subsystems can be correlated to identify and alert security staff of discrepancies or unusual activity.
- the system may include a gateway subsystem that possesses one or more schemas (correlations between data fields within a set of data) in a schema store and secret key information.
- the gateway subsystem comprises a schema that provides a mapping of cleartext data field descriptors together with a set of encryption keys.
- the gateway subsystem further includes a transformation function, which uses a schema to create a schema-based data structure and then one-way encrypts the structure without use of a token table to create secure data.
- the secure data has semantic meaning only within the confines of the gateway subsystem, where the schema and keys are present, digitally stored, and/or accessible. The gateway subsystem does not need to otherwise store the cleartext data after it has been transformed.
- the gateway subsystem communicates the secure data to a server subsystem that is administratively separate from the gateway subsystem.
- the server subsystem does not have access to the schema or secret keys.
- the server subsystem may be used to perform analytics in a secure manner by analyzing the secure data for patterns, which can correspond to patterns in the cleartext data. In this manner, the analytics are performed on a dataset that comprises secure data (in particular, transformed values) and need not comprise any cleartext data. Were a breach of the server subsystem and secure data to occur, the secure data would likely be useless to the attacker.
- the system of the present invention allows a user to upload client data to a gateway subsystem, and either be provided with, or select, a schema.
- the gateway subsystem has a transformation function that takes the client data, along with the data fields, and identification values assigned to designated correlations, and generates a schema-based data structure.
- the transformation function then one-way encrypts the schema-based unique data structure (i.e., makes it mathematically infeasible to recover the cleartext data from the encrypted values without semantic meaning) to create secure data.
- the encryption may in some instances be the result of a secure hashing algorithm applied to the schema-based data structure.
- the values that are hashed may not be the entire original data object, but instead, some selected subset of the original data object (i.e., some combination of fields).
- the exact fields that are extracted and discretized prior to hashing would be defined as part of the schema.
- the secure data is then uploaded to a separate but secure server subsystem for storage in a secured database (e.g., a cloud storage service) and/or analysis.
- the input values may be single values, strings, or n-tuples of values, preferably discretized as part of the transformation process.
- the gateway subsystem which has the schema, can further include a query engine that receives a query from a user, and transforms the query into a query of the secure data. This may be referred to as the "transformation" of the user's query (as opposed to transformation of the input data), and results in a query format consistent with the format of the secure data as it is stored in the secure database of the server subsystem.
- the original raw or cleartext input client data is not available for that query, because the schema and keys are only available on the gateway subsystem where the data and queries were transformed.
- the results of the analysis performed on the secure data is then returned to gateway subsystem, which can then either "reverse transform" the results, to provide raw data correlation to the querying user, or pass the results to the querying user without decoding them. Correlation is possible because each normalized set of transformed values has a unique identifier that can be used by the gateway subsystem to retrieve the cleartext data that resulted in the transformed data in question. Because the database of the server subsystem contains only secure data, the database may be stored in a less secure cloud- hosting or other location if desired, while the gateway subsystem and client subsystem are in different security domains.
- FIG. 1 is a system diagram of an exemplary embodiment according to an aspect of the invention.
- FIG. 2 illustrates a process flow for transforming and storing data in a database according to one embodiment of the invention.
- FIG. 3 illustrates a process flow for querying data stored in a database according to one embodiment of the invention.
- FIG. 4 is a system diagram of an exemplary embodiment for querying already-stored data according to an aspect of the invention.
- FIG. 5 is a system diagram of another exemplary embodiment according to an aspect of the invention.
- FIG. 1 an exemplary embodiment of a system according to an aspect of the invention is illustrated by a system having a client subsystem 100, a gateway subsystem 101 , and a server subsystem 102.
- Client system 100 is a subsystem that may be used directly by clients for querying and analytic operations. Through the client subsystem 100, users are capable of sending queries to the gateway subsystem 101 in order to read or write data.
- the client subsystem 100 may be accessed, for example, by a user on the Internet using a client application. Alternatively, the client subsystem 100 may be accessed by a user using a client console deployed at the user's premises.
- the client subsystem 100 comprises a client agent that is used to establish a connection between the client subsystem and the gateway subsystem for interactions to take place.
- the client agent is able to send query requests for writing and reading to the gateway subsystem. In one embodiment, this may take place after an authentication service has been used to clear that the user trying to establish the session has access to that gateway subsystem.
- the authentication service may be an external service with connectors built into the system for different functions that require authentication.
- each service has an authentication client, which allows these services to request the individual authentication service required by each service, which would be specified in the configuration of that service.
- the authentication service is run by a third party to make the system more secure.
- Client subsystem 100 may include one or more client instances 1 10.
- Client instance 1 10 may be adapted for a user to receive and upload data (e.g., client data or input data) to be securely stored and made query-able by the system.
- Client instance 1 10 comprises a client data gathering agent 104, a data store 103, a client query engine 105, a user interface 106, a display (not illustrated), a communication mechanism for communicating information (not illustrated), and inputs (e.g., keyboard and mouse) (not illustrated).
- Data store 103 may include a memory coupled to bus for storing client data. This memory may also be used for storing variables or other intermediate information during execution of instructions to be executed by agents or engines. Possible implementations of this memory may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both. Databases and file systems may also be used to store and retrieve data from data store 103. Common physical forms of data store 103 include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash memory, a USB memory card, or any other medium from which a computer can read.
- Client instance 1 10 is in communication with client service 120 in gateway subsystem 101.
- Client subsystem 100 and gateway subsystem 101 each includes a network interface (not illustrated).
- the network interfaces may provide two-way data communication between, e.g., client instance 1 10 and client service 120, as well as client service 120 and server subsystem 102.
- the network interface sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information across a local network, an Intranet, or the Internet.
- client instance 1 10 may communicate with a plurality of other computer machines.
- Software components or services may reside on multiple different computer systems or servers across a network.
- the processes described here may be implemented on one or more servers, for example.
- a server may transmit actions or messages from one component, through Internet, local network, and network interfaces to a component on client instance 1 1 0.
- the software components and processes described herein may be implemented on any computer system (including without limitation on custom hardware devices or specially programmed computers) and send and/or receive information across a network, for example.
- the dashed lines shown on Fig. 1 indicate administrative separation of subsystems, but are not intended to specify architecture or deployment requirements, which may be adjusted depending on the needs and circumstances of the system provider and clients.
- An alternative embodiment is illustrated in Fig. 5, for example, which shows that the gateway subsystem 101 and the server subsystem 102 may be deployed, for example, through a common cloud provider, so long as they remain administratively separate.
- a user may access and/or create a client data set using data gathering agent 104, which may have access to unsecure data having fields corresponding to unsecure data field definitions that are stored in, e.g., data store 103.
- data gathering agent 104 is adapted to define a first client data set comprising at least a first data structure.
- client instance 1 10 may encrypt the client data set, or, configure (and optionally normalize) the client data set based on a schema that defines associations between different fields or categories within the data set.
- user interface 106 may also provide a user schema options that allow a user to customize a schema based on a client data set, or a desired query.
- Client instance 1 10 then transmits the client data set (in either a clear format, or in an decryptable format) to client service 120 in gateway subsystem 101 using, e.g., a secure tunnel.
- client instance 1 10 may have multiple data gathering agents 104.
- the client instance 1 10 that generates the client data set may be distinct or different from, such as, without limitation, at a different physical location, the client instance that communicates with client service 120.
- the first client instance may be a business system that simply returns raw data
- the second client instance may be a system that receives or gathers such data, creates a client data set, and communicates it to client service 120.
- the system may comprise a plurality of client instances 1 10 in communication with client service 120, as shown in Fig. 1.
- the gateway subsystem 101 links the client subsystem 100 and the server subsystem 102 as a means of sending information back and forth between the two subsystems.
- the gateway system 101 is deployed on a secure public cloud managed by a cloud provider.
- the server subsystem 102 may be deployed on the same or different secure public cloud, and separated from the gateway subsystem 101 by a virtual firewall (e.g., see Fig. 5).
- the gateway subsystem 101 may be responsible for performing a transformation function, detailed below.
- the gateway subsystem 101 may further comprise a group of services designed to ensure that data is ciphered securely. These services may include one or more of a policy service, a schema service, a key service, and an authentication service.
- the policy service may hold all the policies used by the gateway subsystem 101. These policies are applied to data returned from the server subsystem 102 and used to filter to the data to remove any data inaccessible to the user or invisible to the user. These policies may be created and maintained by system administrators. Policies can be used to ensure that users cannot see any data that they are not permitted to have access to, while allowing them to use this invisible data to query for other necessary information.
- the schema service in one aspect, is responsible for storing schema information. These schemas are used to format queries that are sent by the client subsystem 100, as detailed further below.
- the key service in one aspect, is responsible for holding keys used to access other services and their respective configuration files in the gateway subsystem. This service may also be responsible for decrypting keys. In one embodiment, the key service does not contain the keys responsible for encrypting the data.
- the gateway subsystem 101 may further contain secret information such as secret salts and keys used to, e.g., access configuration for the services and data within the server subsystem 102.
- gateway subsystem 101 may comprise a client service 120, transformation function 130, secure query generator 140, and schema store 150. As discussed above, gateway subsystem 101 may be administratively-isolated from client subsystem 100 and server subsystem 102.
- Client service 120 receives a first client data set from client instance 1 10.
- Transformation function 130 applies a schema from, e.g., schema store 150, to the first client data set to create a second data set, referred to as a schema-based data structure.
- the invention includes a discretization function as part of the first client data set transformation process that produces a complete semantic description of the client data set in the form of multiple hashed values.
- Transformation function 130 then one-way encrypts or hashes the schema-based data structure to generate a third data set, referred to as secure data.
- the hashing portion of the transformation process will be performed using a secure, one-way hashing algorithm such as SHA-2.
- a salt may be used in the hashing function to improve the security of the hashed result.
- transformation function 130 causes a separate hashing function to hash the schema-based data structure to create the secure data.
- transformation function 130 may also generate a unique identifier (e.g., a value) to provide for mapping from the secure data back to the client data set.
- a unique identifier e.g., a value
- Schema store 150 stores a digital representation of the schema.
- the schema may comprise a generalization hierarchy adapted to define identification values to portions of a client data set based on analysis parameters related to the generalization hierarchy. Where generalization hierarchies are used, the schema is agnostic as to queries run during data analytics processes.
- a specific schema may be generated in a manner that is optimized for a specific type of analysis. For example, a specific schema may be generated based on a specific type of data set, such as, without limitation, in a format and/or in a manner that is optimized for specific analytics queries.
- the schema-based structure comprises a primary data structure type, a vector of first secondary-data structures, and a vector of second secondary-data structures.
- Each first secondary-data structure comprises an identity, type, and value.
- Each second secondary-data structure comprises an identity, type, and value derived from a first data subset.
- the first data subset comprises a type and a vector of a second data subset.
- Each second data subset comprises an identity, a type, and a subset value.
- the first data subset corresponds to the primary data structure type. It is derived from at least one first secondary data structure value. Data from the first data subset of the same first data subset structure type have identical vectors as the second data subsets.
- the primary data structure is capable of deriving at least one said first data subset(s).
- the schema is a customer-provided definition of the data structures ("StructuredDef '), data descriptors ("QualityDef ⁇ ') and rules that describe how data descriptors are to be derived from data structures ("QualitizerDef,” “QualitizedDef,” and “DiscretizerDef '). These definitions may provide the structural information needed for the system of the invention to transform, persist, and query user data.
- Each StructuredDef for example, describes a "StructuredType" for one or more "Structured” instances that contain user data.
- Each QualityDef describes a "Quality Type" for one or more "Quality” instances that contain descriptor data.
- each user operates an independent gateway subsystem.
- Each gateway subsystem instance has its own schema, which is created by the user.
- the schema has at least one QualityDef, one QualitizerDef, one DiscretizerDef, one QualitizedDef, and one StructuredDef. Operational deployments, however, may include many instances of each definition type.
- a schema may be created by assembling a collection of data types that the system stores, and a collection of descriptors may be used to characterize each data type.
- a common data type in the telecommunications industry is the telephone call.
- a telephone call consists of a calling party telephone number, a called party telephone number, the date and time that the call started and the duration of the call.
- the descriptors that could be used to characterize a telephone call include: a telephone number (the calling party), another telephone number (the called party) and one or more representations of the date and time the call started, and the date and time span over which the call was active.
- the telephone call example could thus be represented in a schema as follows:
- TYPE TelephoneNumber
- Quantization is the process of deriving one or more qualitative descriptors, known as "Quality(s)," from a user data structure, known as a "Structured.”
- Quality(s) derived from a Structured may comprise a usable representation of the data contained within the Structured instance.
- Discretization is the process that, in some embodiments, the system may use to convert one or more input field values from a Structured instance into a set of output field values, in which each output value corresponds to one field in a distinct descriptor instance.
- Each "Discretizer" can take one or more input fields from a Structured instance, and produce values for multiple Quality instances (i.e., each output from a Discretizer becomes a field value in a separate Quality instance).
- Many Discretizer implementations are known in the art, and may be used to discretize different types of values and represent them in different ways.
- the rules to derive Quality(s) from a Structured instance are specified in the form of one QualitizerDef, one DiscretizerDef, and one QualitizedDef.
- the QualitizerDef defines a mapping between one or more inputs and the fields of a QualityDef.
- Each QualitizerDef contains one or more DiscretizerDef(s).
- the DiscretizerDef defines a mapping between the inputs defined in its parent QualitizerDef, and the parameters of a specific Discretizer function implementation.
- DiscretizerDef there is one DiscretizerDef for each field in the Quality Type specified in the QualitizerDef, and the mappings are performing in corresponding order (i.e., the first DiscretizerDef maps to the first field of the QualityType, the second DiscretizerDef maps to the second field of the QualityType, etc.).
- the QualitizedDef maps the fields of a StructuredDef to the inputs of a QualitizerDef, and associates the Quality that is output by the Quantization process with an identity that is distinct within the scope of its associated StructuredType.
- TelephoneCall StructuredDef has two fields that could be used to construct TelephoneNumber Quality(s): the first field is the "calling party,” and the second is the "called party.”
- the rules to derive Quality(s) from these two fields may be defined by one QualitizerDef, one DiscretizerDef and two QualitizedDef(s), which could be specified in a schema of the invention as follows:
- TYPE TelephoneNumber
- DiscretizerDef ⁇ AME : NumericStringDi screti zer, INPUTS: [ telephoneNumber] ⁇
- TYPE TelephoneCall
- the schema may be used in the transformation process (detailed herein) to validate data and provide the Quantization rules to derive Quality(s) from data or query parameters prior to encryption.
- Data and query parameters may be validated by checking their structure and values against definitions present in the schema.
- the following process steps may be used to derive Quality(s):
- the fields are first compared to those in the StructuredDef for type TelephoneCall to ensure that all required fields are present, and each field type in the submit data corresponds to the type specified in the FieldDef for that field in the StructuredDef.
- the QualitizedDef(s) specified in the QUALITIZED DEF attribute of the corresponding StructuredDef are loaded, e.g., one at a time, and used to obtain the QualitizedDef specified by its TYPE attribute and the fields values that correspond to the field names provided in the FIELDS attribute of the QualitizedDef.
- the fields values obtained by the QualitizedDef serve as the values for each input specified in the INPUTS attribute of the QualitizerDef.
- the field values and inputs are each organized as ordered items in an array. They may be then mapped by matching the field value at each index in the fields array to the input with the same index in the inputs array and placing each corresponding pair into a key-value dictionary, in which the key is the name of the input and the value is the field value.
- Each DiscretizerDef in the DISCRETIZERJ3EFS attribute of the QualitizerDef may be used to discretize one or more of the inputs to one or more outputs, each of which will be placed into a field in a distinct Quality.
- the Quality will be of the type specified by the QUALITY TYPE attribute of the QualitizerDef.
- Each resulting Quality that corresponds to each QualitizedDef is packaged within a Quantized data structure that also includes the identity value specified by the IDENTITY attribute of the QualitizedDef. This makes it possible to distinguish values of similar structure, type, or appearance that were derived from different fields, or using different Qualization or Discretization processes (e.g. the ability to differentiate between the callingParty and the calledParty, in the foregoing example).
- QualitizedDef with identity callingParty specifies QualitizerType TelephoneNumber 2.
- QualitizedDef with identity callingParty specifies callingParty field from StructuredType TelephoneCall should be mapped as the first and only input to QualitizerType TelephoneNumber
- QualitizerType TelephoneNumber specifies that the first and only input should be mapped as the first and only parameter to the NumericStringDiscretizer
- QualitizedDef specifies that each Quality output by its Quantization process will be packaged within a Qualitized instance with identity callingParty
- QualitizerType TelephoneNumber specifies that the first and only input should be mapped as the first and only parameter to the NumericStringDiscretizer
- gateway subsystem 101 the subsystem is adapted to transmit the secure data (together with the unique identifier) to server subsystem 102.
- the server subsystem 102 may itself hold, or be in communication with, a hold for encrypted data stored by users.
- the server subsystem 102 may further perform query operations that are sent by the client subsystem 100. These operations filter the data requested by the user and return the results to the gateway subsystem 101 so that they can be transformed back into cleartext format.
- Server subsystem 102 may include a secure server service 160, secure data store 170, and may preferably have a secure connection (e.g., over a network) with gateway subsystem 101, e.g., interface 180.
- Interface 180 may provide a secure connection between gateway subsystem 101 and server subsystem 102 so that the first client data set that is stored temporarily in client service 120 before the schema is applied by transformation function 130 (and the results one-way encrypted) is not accessible through server subsystem 102.
- the secret keys and secret salts are only present within the gateway subsystem 101 and never exposed in unencrypted form to the server subsusystem 102.
- Unsecured client data is preferably only available in temporary memory storage in the gateway subsystem 101 before it is transformed and one-way encrypted. After the unsecured client data has been transformed, it may cease to exist in the gateway subsystem 101 , and at that point, the system would only be storing the secure data in the secure data store 170.
- Server service 160 may provide connection to and management of secure data store 170, which comprises compute cluster 171 and database 172, and which may be, in one form, a NoSQL cloud-based database.
- secure data store 170 may be deployed, for example, on the same or a different secure cloud.
- query routines consist of (a) hierarchies of logic operations and (b) criteria parameters that have been transformed using, e.g., the same secret keys and secret salts that were used to transform the first client data set.
- client query engine 105 is adapted to create queries to be run on secure data (e.g., secure data stored in database 172).
- Client query engine 105 may use an object query database language or any appropriate query language known to those of skill in the art.
- Query types may include "read” and "write,” in order to, for example, read data from storage, perform a manipulation on the data (e.g., filter, extract, join), or return data as a set of results.
- a user may use the user interface 106 to access client query engine 105 to create and deliver queries to client service 120, and in particular, secure query generator 140.
- the queries can include search requests to locate data within database 172 that meet one or more search parameters.
- Each search parameter can be a value, or a range of values where the search request is to locate database entries that satisfy or are likely to satisfy the value or range of values.
- each search parameter can be a time or time interval.
- a database entry satisfies a single value search parameter when the database entry has a corresponding time field that contains the single value.
- a database entry satisfies a range value search parameter when the database entry has a corresponding time field that contains a time value that is within the range of values specified by the search parameter, which can include the outer boundaries of the range of values.
- the search parameters can be any appropriate parameter evident to one of ordinary skill in the art, e.g., a unit of distance or some other unit of measurement.
- secure query generator 140 may apply a schema to the client query, thereby creating a converted query. Secure query generator 140 may then hash the converted query to create a transformed query.
- Client service 120 transmits the transformed query to server subsystem 102 to be executed by compute cluster 171.
- Compute cluster 171 accesses the relevant secure data in secure database 172 and executes the query.
- the query may produce results.
- Compute cluster 171 returns the results to the secure query generator 140 in client service 120.
- Secure query generator 140 may then return the secure results directly to the client instance 1 10.
- secure query generator 140 uses the unique identifier(s) associated with the secure data to "reverse transform" the secure results in order to identify the cleartext values that correlate to the secure results prior to transmission to client instance 1 10.
- the various descriptors that are used in the query and analysis process in the server subsystem are one-way encrypted using a hash function and hence cannot be decrypted. While those values are used for query and analysis processing in the server subsystem, they are replaced with the symmetrically encrypted complete data records with which they are associated prior to being returned as results.
- the client instance 1 10 may have a distinct client instance from the initial client instance which uploaded the client data set. In this case, only correlation information could be made available to the distinct client instance, and not cleartext data. This arrangement may conveniently allow the viewing of analysis results by users of the distinct client instance who never had access to the raw data of the first client data set.
- a user accesses a user interface of a client instance, and generates a first client data set using a data gathering agent, which may have access to unsecure data having fields corresponding to unsecure data field definitions that are stored in, e.g., a data store.
- Step 201 may optionally include discretization of the values in the first client data set as part of the gathering process.
- the first client data set is then optionally encrypted.
- the user optionally configures the first client data set based on a schema that defines associations between different fields or categories within the data set.
- the user may also customize the schema based upon first client data set, or a desired query.
- the first client data set (in either a clear format, or in an decryptable format) is transmitted to a client service using a secure tunnel.
- a transformation function applies a schema to the first client data set to create a schema-based data structure, as shown in step 205.
- the transformation function one-way encrypts (e.g., hashes) the schema-based data structure to generate secure data.
- the transformation function generates the secure data by applying a (preferably secure) hashing algorithm to the schema-based data structure as part of the transformation process.
- the transformation function causes a separate encrypter to hash the schema-based data structure to create the secure data.
- the transformation function generates a unique identifier to provide a mapping from the secure data back to the data that resulted in the hashed values.
- the secure data (together with the unique identifier) is transmitted to a server subsystem through a secure connection, and then channeled to secure storage in a database, as shown in step 209.
- FIG. 3 An exemplary method 300 for querying secure data is shown in Fig. 3.
- a user accesses a client query engine to create a query in step 301 to be run on secure data.
- the client query is transmitted to a secure query generator in step 302.
- a non-repudiated signature is transmitted along with the queries. Both the query and the corresponding signature are then forwarded to an externally-administered audit service.
- the secure query generator After receiving the client query, the secure query generator applies a schema to the client query as shown in step 303, thereby creating a converted query.
- Secure query generator then one-way encrypts the parameters of the converted query (e.g., data types, condition values, and field identities) to create a transformed query in step 304 and transmits the transformed query in step 305 to be executed by a compute cluster on relevant secure data in step 306.
- the transformed query is accompanied by the corresponding signature.
- a notification is sent to an audit service; the audit service verifies that the query was previously forwarded. Any discrepancies or inconsistencies between audit information reported by the two subsystems results in a security notification.
- step 306 If secure results are generated from the execution of the query, as shown in step 306, the secure results are returned to the secure query generator in step 307.
- Secure query generator may then return the secure results directly to the client instance as shown in step 309, or, alternatively, in step 308, use a unique identifier to "reverse transform" the secure results in order to identify the cleartext values that correlate to the secure results, prior to transmission to a user in step 308.
- the user is presented with the two-way encrypted original record that was associated with the encrypted values.
- the system may be configured as a system 400 for querying already-stored data.
- Such an exemplary system for searching secure data may comprise a schema store 401 , a transformation function 402, a secure query generator 403, and a secure data store 404.
- the schema store 401 may comprise a plurality of unsecure data field definitions and an identifier.
- the transformation function 402 may be adapted to receive unsecure data 405 and generate a transformed data set 406 comprising a token and an identifier and to store the transformed data set in the secure data store 404.
- the secure data store may be adapted to receive secure pattern-matching queries 407, and additionally in some embodiments, the transformed data sets.
- the secure query generator 403 may be adapted to receive unsecure queries 408 based on the unsecure data field definitions and convert the unsecure data field definitions into secure queries 407, and then transit the secure queries 407 to the secure data store 404.
- Secure data store 404 may execute the secure queries on the secure data 406, and generate results. Secure data store 404 may then return the results 409 to secure query generator 403, which may then channel the results to the client subsystem, or, using the identifier, "reverse transform" the results and return, e.g., cleartext results 410 to the client instance.
- the system may comprise a system for securely storing data for data analytics.
- the system may comprise a schema store 401, a transformation function 402, and a secure data store 404.
- the schema store 401 may comprise a plurality of unsecure data field definitions with the transformation function being adapted to receive unsecure data 405 based on those unsecure data field definitions.
- the transformation function may generate a one-way encrypted data set by securely hashing the unsecure data and generate an identifier for the one-way encrypted data, and then transmitting both the one-way encrypted data and the identifier in the secure data store 404.
- the secure data store 404 may be searched for patterns among the one-way encrypted data set without the secure data store containing any unsecure data.
- the secret keys and secret salts are only present within the gateway subsystem and never exposed in unencrypted form to other subsystems. And after the unsecured client data has been one-way encrypted, it may cease to exist in the gateway or server subsystems, and at that point, those subsystems would only be storing the secure data in the secure data store 404.
- a hashing key used for the transformation process may be stored in the gateway subsystem, and thus is never available to either the client subsystem or the server subsystem, and may preferably prevent any decoding of the secure data without gaining access to the hashing key from the gateway subsystem.
- a user may choose to store encrypted raw data within the system along with the secure data. In such an instance, both data would remain securely stored, because the encryption key and the hashing keys may be stored exclusively in the gateway subsystem.
- Such embodiments allow querying of the transformed data, but return of the encrypted data with the transformed data as part of the query results.
- the schema applied by client service 120 is comprised of a data organization schema analogous to a biological olfactory system, in which case such a schema may be referred to as the olfactic model.
- olfaction the process of smelling
- a scent is detected when molecules pass through the nasal cavity and bind to matching receptor cells. This binding causes the neurons that correspond to the receptor to fire.
- a neuron firing is analogous to setting a bit to one in a binary sequence. A unique combination of firing neurons indicates a distinct scent to the brain.
- Such an olfactic model can comprise "aromatics,” “arotopes,” “odorants,” “odotopes,” and “odorizers.”
- An aromatic is analogous to a data record or row in a relational database.
- An arotope is a sub-secondary data structure (e.g., an attribute of an aromatic) comprising an arotope identity value, arotope data type, and arotope value.
- An odorant is another sub-secondary data structure comprising an odorant identity, odorant type, and odorant value, and represents a single descriptive characteristic derived from an aromatic.
- one or more odorants are derived from every aromatic, and collectively, they form a complete description of the aromatic from which they were derived.
- Odorants can be one-way ciphered such that the ciphered values of any two odorants with the exact same odorant type and identical vectors of odotopes will be equal and can be efficiently tested for such equality.
- the odorant may be derived from the arotope values, wherein the odorants of the same odorant type have identical vectors of odotopes.
- An odotope is another sub-secondary data structure comprising an odotope identity, odotope type, and odotope value.
- An odotope may correspond to a category of data fields within an initial client data structure.
- An odorizer is another sub-secondary data structure comprising an odorizer identity value and a vector of category functions, such as, without limitation, discretizer functions adapted to create a discrete generalization hierarchy of the data fields within a first client data set.
- the odorizers may comprise a second set of data identifiers, and the generalization hierarchy may correspond to category functions.
- the category functions may be customized or user-defined.
- the client data set comprises at least one aromatic.
- An aromatic represents the combination of data type and field values in the client data set.
- Each record in the client data set is treated like an aromatic molecule.
- Two aromatics possess identical "shapes" if, and only if, they are of the same type and have identical field values.
- An aromatic can be characterized by odorants, which is a discrete representation of one distinct characteristic of the aromatic and can be tested for equality with other odorants.
- the aromatic designates a data record and the odorant designates a descriptive characteristic of the data record. This is analogous to the binding of molecules to receptor cells in biological olfaction, as two surface sections of an aromatic molecule must be identical in shape in order to be considered a match.
- Each aromatic is stored in symmetrically-encrypted form, which must be decrypted before it can be used in any meaningful way. Encrypted query and analysis is made possible by the association of one or more odorants with each aromatic.
- odorants derived from an aromatic effectively describe it as a collection of discrete values that can be compared with other odorants as criteria values using one or more logic operations. Because odorant values are only tested for equality with other odorants, odorants can be transformed using any available cipher algorithm that will consistently produce the same output value from a given input value.
- Each encrypted aromatic is indexed using one or more odorants that have been encrypted using a one-way cipher algorithm.
- a reverse index is available so that a known encrypted aromatic value can be used to obtain all encrypted odorants that were derived from it.
- Query and analysis operations can be performed upon the aromatic by using the associated encrypted odorant values as a ciphered network of facts.
- the ciphered network of facts that describe and represent an aromatic serves as its proxy for query and analysis operations that must be performed while the aromatic remains encrypted.
- one or more aromatics are returned as the result set.
- the aromatics in the database and their corresponding odorants (as well as the odorants that were provided as criteria), remain encrypted.
- an embodiment of the invention comprises a transformation process whereby a client data set is transformed to a representation in the form of a ciphered network of facts.
- This representation of the client data set can be queried and analyzed using instructions and parameters that have undergone the same transformation process, without the need to decrypt the data being analyzed.
- the transformed client data set is a complete representation of the original pre-transformed client data set, and can be transformed back into its original cleartext format without the need to maintain cumbersome translation tables such as those required by data masking and traditional tokenization mechanisms.
- Arotopes from an aromatic that are specified for an odorizer within a schema are input in an odorizer.
- the odorizer outputs one or more odorants.
- a single odorizer may output more than one odorant.
- Each odorant output by a single odorizer will have the same odorant type and odorizer identity, which is an identifier that is unique within the scope of an aromatic-type and uniquely identifies the collection of arotopes and the odorizer that is discretizing them into odorant(s).
- the original aromatic along with all of the outputted odorants are put into an odorized-aromatic object.
- the aromatic is then serialized.
- a secret salt that is stored with set of secret salts (used in the transformation process for each aromatic type) is appended to the serialized aromatic.
- the resulting value is ciphered using a secure hash function to form a value called the Ate.
- the aromatic is serialized.
- a symmetric cipher is applied to the serialized aromatic to form a variable-length binary value called the Ac.
- the aromatic type is then serialized.
- a secret salt is appended to the serialized aromatic type.
- the resulting value is ciphered using a secure hash function to form a value called the AtO.
- Each odorant is then serialized.
- a secret salt that is stored with set of secret salts used in the transformation process for each odorant type is appended to the serialized odorant.
- the resulting value is ciphered using a secure hash function to form a value called the Ot.
- the odorant type is then serialized.
- a secret salt is appended to the serialized odorant type.
- the resulting value is ciphered using a secure hash function to form a value called the OtO.
- the odorizer identity is then serialized.
- a secret salt that is stored with set of secret salts used in the transformation process for each for each aromatic type is appended to the serialized odorizer identity.
- the resulting value is ciphered using a secure hash function to form a value called the Psi.
- the odorant discretization order is obtained from the odorant as an array of one-byte signed values called the Rho.
- Each Ot, OtO, Psi and Rho are put into an Aot.Vt object.
- the Ate, Ac, AtO, and all Aot.Vt(s) are put into an Aot object.
- the Aot object is the final result of the transformation process from query Q c object to query Q s object.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- General Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Power Engineering (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This invention relates to systems and methods for creating, storing, and analyzing secure data.
Description
SYSTEMS AND METHODS FOR CREATING, STORING, AND
ANALYZING SECURE DATA
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority from U.S. Provisional Patent Application Ser. No. 62/414,081 , filed October 28, 2016, which is hereby incorporated by reference in its entirety.
FIELD OF THE INVENTION
[0002] This invention relates to systems and methods for creating, storing, and analyzing secure data stores. In one form of a method embodiment, a user uploads a set of client data and set of correlations between data fields from a client subsystem to a gateway subsystem. The gateway subsystem applies a schema to generate a data structure from the client data that is then transformed. The transformed data structure is then transmitted and stored in a secure database on a server subsystem that is administratively separate from the gateway subsystem. Coded keys and schema solutions are only stored on the gateway subsystem, and are not accessible by the client subsystem or the server subsystem.
BACKGROUND OF THE INVENTION
[0003] Data is being collected and stored at unprecedented volumes. This trend is forecast to continue for the foreseeable future. Further, this data and the information within can be the lifeblood of the organizations that generate and use it. Thus, data analytics has become an important component of both everyday operations and the research and development work that enables organizations to evolve as demands change.
[0004] Complicating the issue of data analytics is the fact that the data to be analyzed may contain personal, financial, or other types of information that raises confidentiality concerns. Further, government regulations regarding data privacy are becoming increasingly strict and
security breaches of data subject to such regulations can be extremely costly, and continue to grow in frequency and severity from an expanding variety of attack vectors. Added to these issues are insider threats, which render existing data security practices vulnerable. Insider threats have been exposed as a critical weakness with limited well-tested methods for proactive defense.
[0005] To balance the need to utilize data analytics against the need to protect the confidentiality and privacy of the data to be analyzed, improved systems and methods of performing analytics on big data sets are needed. Organizations are seeking methods that enable them to perform meaningful and timely analytics on massive-scale data sets without compromising security or violating privacy restrictions. This creates a conundrum as data hiding and meaningful analysis have generally been mutually exclusive. The problem this presents is to provide systems and methods for performing analytics on data that has been secured prior to the commencement of the analytics process so that a breach of the secure data will not compromise the confidentiality and privacy of the underlying data.
[0006] Several methods for securing data are known. For example, hashing and encryption are two methods for securing information when it is being transmitted on the Internet or stored at rest. They can both help satisfy regulatory requirements such as those under PCI DSS, HIPAA-HITECH, GLBA, ITAR, and the EU GDPR. While hashing and encryption are both effective data obfuscation technologies, they are not the same thing, and each technology has its own strengths and weaknesses. In some cases, such as with electronic payment data, both encryption and hashing are used to secure the end-to-end process.
Tokenization/Data Hashing
[0007] Traditionally, hashing, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent from which it is mathematically infeasible to reverse engineer the equivalent to determine the semantic content of the original
data element. In traditional hashing, a very minor change in a data element is likely to result in a dramatic change to the hashed value. As a result, where secure hashing is used, two values with the same semantic meaning, but slight variations in form, will result in very different hashed values. Tokenization is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. Where such techniques are used, creation and maintenance of a token lookup table (a token store or token vault) has historically been required to store mappings between tokens and actual values for de-tokenization to be possible.
[0008] U.S. Patent Number 9,336,256 to Boukobza shows a data tokenization system that includes a token vault. Boukobza discloses a database network router ("DN "), which serves as the an intermediary node between an application and a tokenized database. The database communicates directly with a token vault through the use of a DNR software agent running on the database that parses and executes commands received as part of database access requests from the DNR. Through the use of the DNR and the DNR software agent, the application can be decoupled from the database, and the burden of integrating tokenization APIs and performing tokenization or de-tokenization functions can be shifted to the DNR and the DNR software agent. Additionally, by removing the tokenization and de-tokenization functions from the application, multiple tokenization vendors can utilize the DNR and a DNR software agent as the interface between their own databases and applications.
[0009] Tokenization methods like that disclosed in Boukobza have drawbacks. Such methods require a lookup table or token store/vault that contains mappings between tokens and actual values. These token stores or vaults may contain sensitive data in cleartext and represent an alternate breach target. And, for large databases, translation tables can become costly to scale and maintain due to the need for synchronization and high availability. It is also difficult
(and slow) to exchange data, since such methods require direct access to a token vault mapping token values.
[0010] As is described above, hashing is the process of transforming a string of characters or the value into a (usually) shorter fixed-length value or key that represents the original string. Hashing can be used to determine if there has been a minor change in the original since every such change will result in a different hashed value. Hashing can also be used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value.
[0011] A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply "hashes." One use of hashed values is a data structure called a hash table, widely used in computer software for rapid data lookup. Hashing, however, is a one-way function that scrambles plain text to produce a unique message digest. With a properly designed algorithm, there is no way to reverse the hashing process to reveal the original data. Popular hashing algorithms include MD5, SHA-0, SHA-1, and SHA-2. Encryption, as discussed below, is a two-way function; what is encrypted can be decrypted with the proper key. Using hashing thus encrypts data in a one-way fashion. If the universe of values being hashed is known, an attacker that has access to the same hashing algorithm can build rainbow tables (i.e., pre-calculated tables for a specific hash) of possible known values to search for target data. Hashing, thus, can be insecure where there is a reasonably finite quantity of input values and a known hash function. Hashing can be secure, however, when either the hash function is unknown or the universe of input values is too large for the creation of a useful rainbow table.
Encryption/Homomorphic Encryption
[0012] Encryption is the process of using an algorithm to transform plain text information into a non-readable form called ciphertext. An algorithm and an encryption key are required to decrypt the information and return it to its original plain text format. Today, SSL encryption is commonly used to protect information as it is transmitted on the Internet. Using built-in encryption capabilities of operating systems or third-party encryption tools, millions of people encrypt data on their computers to protect against the accidental loss of sensitive data in the event their computer is stolen. Encryption can be used to thwart government surveillance and theft of sensitive corporate data when a secure algorithm and reasonable encryption keys are used. Most encryption techniques involve encrypting data at rest but not in use (i.e., data is encrypted while stored on a disk, but decrypted before or during processing). Popular encryption algorithms include DES, RSA, AES, Blowfish, and Twofish, as well as others known to those of skill in the art. Encryption can be performed with software such as PGP, or with custom software or functions built into operating systems or encryption APIs.
[0013] Traditional encryption methods have drawbacks. For example, these techniques often co-lpcate the data and the keys in the same security domain, which allows a single breach to access both keys and data. Security methods and systems that do not co-locate data and keys in the same security domain are therefore desired.
[0014] For example, U.S. Patent Number 9,087,212 to Balak ishnan et al. discusses a sequence of steps for implementing database confidentiality. A database proxy intercepts SQL queries and rewrites the queries to execute on encrypted data. When faced with a first threat, the system guards against an external attacker with full access to the data stored in a database management system ("DBMS") server. The attacker is assumed to be passive, i.e., wants to learn confidential data, but does not change queries issued by the application, query results, or the data in the DBMS. That threat includes DBMS software compromises, root
access to DBMS machines, and even access to the RAM of physical machines. The proxy stores a secret master key MK, a database schema, and the current encryption layers of all columns. The system executes SQL queries over encrypted data. The DBMS machines and administrators are not trusted, but the application and the proxy are trusted. The system enables the DBMS server to execute SQL queries on encrypted data almost as if it were executing the same queries on plaintext data so that existing DBMSes do not need to be changed. The DBMS query plan for an encrypted query is typically the same as for the original query, except that the operators comprising the query, such as selections, projections, joins, aggregates, and orderings, are performed on ciphertexts, and use modified operators in some cases. The DBMS server returns the (encrypted) query result, which the proxy decrypts and returns to the application.
[0015] Homomorphic encryption is a form of encryption that allows computations to be carried out on ciphertext, thus generating an encrypted result which, when decrypted, matches the result of operations performed on the plaintext. These techniques imply the requirement of one or more (partially) homomorphic encryption schemes to manipulate column (field) values in a meaningful way while encrypted.
[0016] For example, the presentation "Private Database Queries Using Somewhat Homomorphic Encryption" by Boneh, et al. discloses a private database query system. A database server is split into two entities (called "server" and "proxy"), and privacy holds only so long as these two entities do not collude.. Boneh encodes the database as one or more polynomials, manipulating these polynomials using a clients' query so as to obtain a new polynomial whose roots are the indexes of the matching records. For every attribute-value pair (a, v) in the database, the inverted index contains a record (tg, Enc(A(x))) where tg is a tag, computed as tg=Hash("a=v"), and A(x) is a polynomial whose roots are exactly the records indexes r that contain this attribute-value pair. In the basic three-party protocol,
given a query SELECT * FROM db WHERE al=vl AND ■ · · AND at=vt the client (with oblivious help from the server) computes the tags tgi=Hash("ai=vi") and sends them to the proxy.
[0017] Homomorphic encryption techniques like that described in Boneh also have several drawbacks. Computations are highly compute-intensive and have been demonstrated to perform poorly even at small scale. Further, values in these techniques are mutable and may be changed or have encrypted values computed as the result of homomorphic computations.
[0018] A security model that does not suffer from these and the foregoing drawbacks is therefore desired, particularly for use in data analytics applications. More particularly, a model that relies on proven, strong encryption algorithms, includes simple logic operations that have been demonstrated to work efficiently at massive scale so that they can be easily distributed for parallel computing across many systems, and does not operate on the original data, but rather, a network of encrypted values that represent the original data, is desired.
SUMMARY OF THE INVENTION
[0019] Embodiments of solutions to the problems described above include improved systems and methods of performing data analytics.
[0020] In one form, a system according to the present invention comprises a database designed to support data encryption in transit, at rest, and in use. The system applies a transformation process to represent user data as a network of ciphered facts. This network of facts can be queried and analyzed using instructions that have been ciphered using the same process that ciphered the data itself. The transformed data is a complete representation of the original pre-transformed data and can be transformed back into its original cleartext format without the need to maintain cumbersome translation tables such as those required by data masking and tokenization mechanisms.
[0021] In another form, a system according to the present invention is divided into three administratively distinct subsystems that ensures no single domain breach can result in the loss of useable data. The subsystems are referred to "client," "gateway," and "server."
[0022] The client subsystem, for example, creates and submits queries and analytical routines in user-readable format. The gateway subsystem, for example, stores schema and secret information, performs data transformation processes, and enforces access control policies. The server subsystem, for example, writes the transformed encrypted data to storage, and executes distributed read queries and analysis operations on encrypted data. In one embodiment, queries and analysis routines originate in the client subsystem, are transformed in the gateway subsystem, and are then executed by server subsystem. The server system collects the results of read queries and analyses and relays them to the gateway subsystem, where they may be transformed and filtered using data access policies before being delivered in user-readable format to the requesting client subsystem.
[0023] In yet another form, a system according to the present invention comprises client, gateway, and server subsystems. The client subsystem is the connection point between users or external applications and the system. The gateway subsystem stores and secures a database schema and secret keys, performs data transformation and encryption, and enforces data access policies. The server subsystem provides bulk data storage and distributed- computing resources to perform advanced queries and analysis upon stored, secure data.
[0024] The gateway subsystem serves as the interconnection service between the client subsystem and server subsystem. The users and applications accessing the client subsystem create queries and analysis routines and forward them to gateway subsystem. The gateway subsystem transforms queries and analysis routines received from the client subsystem into a ciphered-representation and relays the representation to the server subsystem. The server subsystem executes the queries and analysis operations and returns one or more result sets to
the gateway subsystem. Upon receipt of each result set, the gateway subsystem may transform the ciphered data back into its original form, apply any fine-grained data access policies that are configured, and relay the policy-filtered results to the client subsystem and on to the requesting users and applications.
[0025] The three-subsystem architecture makes it possible to distribute the components necessary for a threat actor to obtain sensitive data into separate administrative domains. The gateway subsystem does not share secret keys or schema information with the server subsystem. While such an embodiment can work with one client, multiple clients each having access to a subset of the source data can also be used. In such embodiments, each client subsystem has access to a subset of the schema, but never has access to the secret keys. All operations within the server subsystem are performed using a ciphered network of facts, ciphered-query and analysis parameters, and pre-defined collections of logic instructions. As a result, users and systems within the server subsystem never have access to any cleartext data, secret keys or schema information; users and systems within the client subsystem never have access to bulk data or secret keys; and users and systems within the gateway subsystem does not have the ability to create queries or directly access bulk data. Collectively, these benefits ensure that even a complete breach of any single subsystem cannot yield any useable data.
[0026] In addition to the security benefits provided by the multiple-subsystem architecture, systems according to the present invention may provide audit reporting in both the gateway and server subsystems. For example, the gateway subsystem audit may report the original query and a non-repudiated query signature provided by the originating client. The query signature is included with the transformed query when it is relayed to the server subsystem. The server subsystem report includes the query signature and the identity of the gateway
subsystem that relayed it. Reports from the two subsystems can be correlated to identify and alert security staff of discrepancies or unusual activity.
[0027] In another form, the system may include a gateway subsystem that possesses one or more schemas (correlations between data fields within a set of data) in a schema store and secret key information. For example, the gateway subsystem comprises a schema that provides a mapping of cleartext data field descriptors together with a set of encryption keys. The gateway subsystem further includes a transformation function, which uses a schema to create a schema-based data structure and then one-way encrypts the structure without use of a token table to create secure data. The secure data has semantic meaning only within the confines of the gateway subsystem, where the schema and keys are present, digitally stored, and/or accessible. The gateway subsystem does not need to otherwise store the cleartext data after it has been transformed.
[0028] The gateway subsystem communicates the secure data to a server subsystem that is administratively separate from the gateway subsystem. The server subsystem does not have access to the schema or secret keys. The server subsystem may be used to perform analytics in a secure manner by analyzing the secure data for patterns, which can correspond to patterns in the cleartext data. In this manner, the analytics are performed on a dataset that comprises secure data (in particular, transformed values) and need not comprise any cleartext data. Were a breach of the server subsystem and secure data to occur, the secure data would likely be useless to the attacker.
[0029] In another form, the system of the present invention allows a user to upload client data to a gateway subsystem, and either be provided with, or select, a schema. The gateway subsystem has a transformation function that takes the client data, along with the data fields, and identification values assigned to designated correlations, and generates a schema-based data structure. The transformation function then one-way encrypts the schema-based unique
data structure (i.e., makes it mathematically infeasible to recover the cleartext data from the encrypted values without semantic meaning) to create secure data. The encryption may in some instances be the result of a secure hashing algorithm applied to the schema-based data structure. The values that are hashed may not be the entire original data object, but instead, some selected subset of the original data object (i.e., some combination of fields). The exact fields that are extracted and discretized prior to hashing would be defined as part of the schema. The secure data is then uploaded to a separate but secure server subsystem for storage in a secured database (e.g., a cloud storage service) and/or analysis. In such embodiments the input values may be single values, strings, or n-tuples of values, preferably discretized as part of the transformation process.
[0030] Analysis may be conducted on the secure data (e.g., the hashed values) as they exist in the database of the server subsystem. This is accomplished by virtue of a user having an understanding of what correlations and data fields were defined in the schema, and structuring a query based on that understanding. The gateway subsystem, which has the schema, can further include a query engine that receives a query from a user, and transforms the query into a query of the secure data. This may be referred to as the "transformation" of the user's query (as opposed to transformation of the input data), and results in a query format consistent with the format of the secure data as it is stored in the secure database of the server subsystem. The original raw or cleartext input client data is not available for that query, because the schema and keys are only available on the gateway subsystem where the data and queries were transformed. The results of the analysis performed on the secure data is then returned to gateway subsystem, which can then either "reverse transform" the results, to provide raw data correlation to the querying user, or pass the results to the querying user without decoding them. Correlation is possible because each normalized set of transformed values has a unique identifier that can be used by the gateway subsystem to retrieve the
cleartext data that resulted in the transformed data in question. Because the database of the server subsystem contains only secure data, the database may be stored in a less secure cloud- hosting or other location if desired, while the gateway subsystem and client subsystem are in different security domains.
BRIEF DESCRIPTION OF THE DRAWINGS
[0031] Other features in the invention will become apparent from the attached drawings, which illustrate certain embodiments of the apparatus of this invention, wherein:
[0032] Fig. 1 is a system diagram of an exemplary embodiment according to an aspect of the invention.
[0033] Fig. 2 illustrates a process flow for transforming and storing data in a database according to one embodiment of the invention.
[0034] Fig. 3 illustrates a process flow for querying data stored in a database according to one embodiment of the invention.
[0035] Fig. 4 is a system diagram of an exemplary embodiment for querying already-stored data according to an aspect of the invention.
[0036] Fig. 5 is a system diagram of another exemplary embodiment according to an aspect of the invention.
DETAILED DESCRIPTION
[0037] In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be evident, however, to one skilled in the art, that the present disclosure as expressed in the claims may include some or all of the features in these examples alone or in combination with other features described below, and may further include modifications and equivalents of the features and concepts described herein.
[0038] Referring now to Fig. 1, an exemplary embodiment of a system according to an aspect of the invention is illustrated by a system having a client subsystem 100, a gateway subsystem 101 , and a server subsystem 102.
[0039] Client system 100 is a subsystem that may be used directly by clients for querying and analytic operations. Through the client subsystem 100, users are capable of sending queries to the gateway subsystem 101 in order to read or write data. The client subsystem 100 may be accessed, for example, by a user on the Internet using a client application. Alternatively, the client subsystem 100 may be accessed by a user using a client console deployed at the user's premises.
[0040] The client subsystem 100 comprises a client agent that is used to establish a connection between the client subsystem and the gateway subsystem for interactions to take place. In some embodiments, once a connection is established, the client agent is able to send query requests for writing and reading to the gateway subsystem. In one embodiment, this may take place after an authentication service has been used to clear that the user trying to establish the session has access to that gateway subsystem. The authentication service may be an external service with connectors built into the system for different functions that require authentication. In some embodiments, each service has an authentication client, which allows these services to request the individual authentication service required by each service, which would be specified in the configuration of that service. In some embodiments, the authentication service is run by a third party to make the system more secure.
[0041] Client subsystem 100 may include one or more client instances 1 10. Client instance 1 10 may be adapted for a user to receive and upload data (e.g., client data or input data) to be securely stored and made query-able by the system. Client instance 1 10 comprises a client data gathering agent 104, a data store 103, a client query engine 105, a user interface 106, a
display (not illustrated), a communication mechanism for communicating information (not illustrated), and inputs (e.g., keyboard and mouse) (not illustrated).
[0042] Data store 103 may include a memory coupled to bus for storing client data. This memory may also be used for storing variables or other intermediate information during execution of instructions to be executed by agents or engines. Possible implementations of this memory may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both. Databases and file systems may also be used to store and retrieve data from data store 103. Common physical forms of data store 103 include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash memory, a USB memory card, or any other medium from which a computer can read.
[0043] Client instance 1 10 is in communication with client service 120 in gateway subsystem 101. Client subsystem 100 and gateway subsystem 101 each includes a network interface (not illustrated). The network interfaces may provide two-way data communication between, e.g., client instance 1 10 and client service 120, as well as client service 120 and server subsystem 102. The network interface sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information across a local network, an Intranet, or the Internet.
[0044] For a local network, client instance 1 10 may communicate with a plurality of other computer machines. Software components or services may reside on multiple different computer systems or servers across a network. The processes described here may be implemented on one or more servers, for example. A server may transmit actions or messages from one component, through Internet, local network, and network interfaces to a component on client instance 1 1 0. The software components and processes described herein may be implemented on any computer system (including without limitation on custom hardware devices or specially programmed computers) and send and/or receive information
across a network, for example. Further, the dashed lines shown on Fig. 1 indicate administrative separation of subsystems, but are not intended to specify architecture or deployment requirements, which may be adjusted depending on the needs and circumstances of the system provider and clients. An alternative embodiment is illustrated in Fig. 5, for example, which shows that the gateway subsystem 101 and the server subsystem 102 may be deployed, for example, through a common cloud provider, so long as they remain administratively separate.
[0045] Turning back to Fig. 1, through the user interface 106, a user may access and/or create a client data set using data gathering agent 104, which may have access to unsecure data having fields corresponding to unsecure data field definitions that are stored in, e.g., data store 103. In one embodiment, data gathering agent 104 is adapted to define a first client data set comprising at least a first data structure. In other embodiments, client instance 1 10 (through, e.g., data gathering agent 104) may encrypt the client data set, or, configure (and optionally normalize) the client data set based on a schema that defines associations between different fields or categories within the data set. In some embodiments, user interface 106 may also provide a user schema options that allow a user to customize a schema based on a client data set, or a desired query. Client instance 1 10 then transmits the client data set (in either a clear format, or in an decryptable format) to client service 120 in gateway subsystem 101 using, e.g., a secure tunnel.
[0046] In some embodiments, client instance 1 10 may have multiple data gathering agents 104. In the same or other embodiments, the client instance 1 10 that generates the client data set may be distinct or different from, such as, without limitation, at a different physical location, the client instance that communicates with client service 120. For example, the first client instance may be a business system that simply returns raw data, and the second client instance may be a system that receives or gathers such data, creates a client data set, and
communicates it to client service 120. In further embodiments, the system may comprise a plurality of client instances 1 10 in communication with client service 120, as shown in Fig. 1.
[0047] In some embodiments, the gateway subsystem 101 links the client subsystem 100 and the server subsystem 102 as a means of sending information back and forth between the two subsystems. In one exemplary embodiment, the gateway system 101 is deployed on a secure public cloud managed by a cloud provider. The server subsystem 102 may be deployed on the same or different secure public cloud, and separated from the gateway subsystem 101 by a virtual firewall (e.g., see Fig. 5).
[0048] The gateway subsystem 101 may be responsible for performing a transformation function, detailed below. The gateway subsystem 101 may further comprise a group of services designed to ensure that data is ciphered securely. These services may include one or more of a policy service, a schema service, a key service, and an authentication service.
[0049] The policy service may hold all the policies used by the gateway subsystem 101. These policies are applied to data returned from the server subsystem 102 and used to filter to the data to remove any data inaccessible to the user or invisible to the user. These policies may be created and maintained by system administrators. Policies can be used to ensure that users cannot see any data that they are not permitted to have access to, while allowing them to use this invisible data to query for other necessary information.
[0050] The schema service, in one aspect, is responsible for storing schema information. These schemas are used to format queries that are sent by the client subsystem 100, as detailed further below.
[0051] The key service, in one aspect, is responsible for holding keys used to access other services and their respective configuration files in the gateway subsystem. This service may also be responsible for decrypting keys. In one embodiment, the key service does not contain the keys responsible for encrypting the data.
[0052] The gateway subsystem 101 may further contain secret information such as secret salts and keys used to, e.g., access configuration for the services and data within the server subsystem 102.
[0053] In one embodiment, gateway subsystem 101 may comprise a client service 120, transformation function 130, secure query generator 140, and schema store 150. As discussed above, gateway subsystem 101 may be administratively-isolated from client subsystem 100 and server subsystem 102.
[0054] Client service 120 receives a first client data set from client instance 1 10. Transformation function 130 applies a schema from, e.g., schema store 150, to the first client data set to create a second data set, referred to as a schema-based data structure. In one embodiment, the invention includes a discretization function as part of the first client data set transformation process that produces a complete semantic description of the client data set in the form of multiple hashed values.
[0055] Transformation function 130 then one-way encrypts or hashes the schema-based data structure to generate a third data set, referred to as secure data. Preferably, the hashing portion of the transformation process will be performed using a secure, one-way hashing algorithm such as SHA-2. A salt may be used in the hashing function to improve the security of the hashed result. Alternatively, transformation function 130 causes a separate hashing function to hash the schema-based data structure to create the secure data.
[0056] Because the secure data may not be conveniently converted back into clear data, transformation function 130 may also generate a unique identifier (e.g., a value) to provide for mapping from the secure data back to the client data set.
[0057] Schema store 150 stores a digital representation of the schema. In some embodiments, the schema may comprise a generalization hierarchy adapted to define identification values to portions of a client data set based on analysis parameters related to the
generalization hierarchy. Where generalization hierarchies are used, the schema is agnostic as to queries run during data analytics processes. In other embodiments, a specific schema may be generated in a manner that is optimized for a specific type of analysis. For example, a specific schema may be generated based on a specific type of data set, such as, without limitation, in a format and/or in a manner that is optimized for specific analytics queries.
[0058] In one embodiment, the schema-based structure comprises a primary data structure type, a vector of first secondary-data structures, and a vector of second secondary-data structures. Each first secondary-data structure comprises an identity, type, and value. Each second secondary-data structure comprises an identity, type, and value derived from a first data subset. The first data subset comprises a type and a vector of a second data subset. Each second data subset comprises an identity, a type, and a subset value. The first data subset corresponds to the primary data structure type. It is derived from at least one first secondary data structure value. Data from the first data subset of the same first data subset structure type have identical vectors as the second data subsets. The primary data structure is capable of deriving at least one said first data subset(s).
[0059] The creation and application of the schema according to the invention may be further illustrated by additional examples. In one embodiment, the schema is a customer-provided definition of the data structures ("StructuredDef '), data descriptors ("QualityDef ·') and rules that describe how data descriptors are to be derived from data structures ("QualitizerDef," "QualitizedDef," and "DiscretizerDef '). These definitions may provide the structural information needed for the system of the invention to transform, persist, and query user data. Each StructuredDef, for example, describes a "StructuredType" for one or more "Structured" instances that contain user data. Each QualityDef describes a "Quality Type" for one or more "Quality" instances that contain descriptor data.
[0060] In one embodiment, each user operates an independent gateway subsystem. Each gateway subsystem instance has its own schema, which is created by the user. In this example, the schema has at least one QualityDef, one QualitizerDef, one DiscretizerDef, one QualitizedDef, and one StructuredDef. Operational deployments, however, may include many instances of each definition type. A schema may be created by assembling a collection of data types that the system stores, and a collection of descriptors may be used to characterize each data type.
[0061] For example, a common data type in the telecommunications industry is the telephone call. In its simplest form, a telephone call consists of a calling party telephone number, a called party telephone number, the date and time that the call started and the duration of the call. The descriptors that could be used to characterize a telephone call include: a telephone number (the calling party), another telephone number (the called party) and one or more representations of the date and time the call started, and the date and time span over which the call was active. The telephone call example could thus be represented in a schema as follows:
StructuredDef: {
TYPE: TelephoneCall ,
FIELDS : [
FieldDef : { AME : callingParty, YPE : VarChar } ,
FieldDef : { NAME : calledParty, TYPE : VarChar } ,
FieldDef : {NAME : callStart, YPE : Timestamp} ,
FieldDef : { AME : duration, YPE : Int }
] } ;
QualityDef : {
TYPE : TelephoneNumber ,
FIELDS : [
FieldDef : {NAME : number, YPE : VarChar }
] } ;
[0062] Once all of the StructuredDef(s) and QualityDef(s) have been defined, the rules for how Quality(s) will be derived from Structured(s) may then be defined. [0063] Before detailing the rules themselves, though, it may be useful to describe the concepts of "Quantization" and "Discretization." In one form, Quantization is the process of
deriving one or more qualitative descriptors, known as "Quality(s)," from a user data structure, known as a "Structured." Collectively, the Quality(s) derived from a Structured may comprise a usable representation of the data contained within the Structured instance. Discretization is the process that, in some embodiments, the system may use to convert one or more input field values from a Structured instance into a set of output field values, in which each output value corresponds to one field in a distinct descriptor instance. Each "Discretizer" can take one or more input fields from a Structured instance, and produce values for multiple Quality instances (i.e., each output from a Discretizer becomes a field value in a separate Quality instance). Many Discretizer implementations are known in the art, and may be used to discretize different types of values and represent them in different ways.
[0064] In this example, the rules to derive Quality(s) from a Structured instance are specified in the form of one QualitizerDef, one DiscretizerDef, and one QualitizedDef. The QualitizerDef defines a mapping between one or more inputs and the fields of a QualityDef. Each QualitizerDef contains one or more DiscretizerDef(s). The DiscretizerDef defines a mapping between the inputs defined in its parent QualitizerDef, and the parameters of a specific Discretizer function implementation. In this instance, there is one DiscretizerDef for each field in the Quality Type specified in the QualitizerDef, and the mappings are performing in corresponding order (i.e., the first DiscretizerDef maps to the first field of the QualityType, the second DiscretizerDef maps to the second field of the QualityType, etc.). The QualitizedDef maps the fields of a StructuredDef to the inputs of a QualitizerDef, and associates the Quality that is output by the Quantization process with an identity that is distinct within the scope of its associated StructuredType.
[0065] For example, TelephoneCall StructuredDef has two fields that could be used to construct TelephoneNumber Quality(s): the first field is the "calling party," and the second is the "called party." The rules to derive Quality(s) from these two fields may be defined by
one QualitizerDef, one DiscretizerDef and two QualitizedDef(s), which could be specified in a schema of the invention as follows:
QualitizerDef: {
TYPE : TelephoneNumber,
INPUTS : [
Input: {NAME : telephoneNumber , TYPE : VarChar }
] ,
DISCRETIZER DEFS : [
DiscretizerDef: { AME : NumericStringDi screti zer, INPUTS: [ telephoneNumber] }
] ,
QUALITY_TYPE : TelephoneNumber
}
StructuredDef: {
TYPE : TelephoneCall,
QUAL IT I ZED_DEFS : [
QualitizedDef : {TYPE : TelephoneNumber, IDENTITY : callingPa rty, FIELDS: [ callingParty] }
]
[0066] In this example, the schema may be used in the transformation process (detailed herein) to validate data and provide the Quantization rules to derive Quality(s) from data or query parameters prior to encryption. Data and query parameters may be validated by checking their structure and values against definitions present in the schema. [0067] In one example, the following process steps may be used to derive Quality(s):
1. The fields are first compared to those in the StructuredDef for type TelephoneCall to ensure that all required fields are present, and each field type in the submit data corresponds to the type specified in the FieldDef for that field in the StructuredDef.
2. If validated, the QualitizedDef(s) specified in the QUALITIZED DEF attribute of the corresponding StructuredDef are loaded, e.g., one at a time, and used to obtain the QualitizedDef specified by its TYPE attribute and the fields values that correspond to the field names provided in the FIELDS attribute of the QualitizedDef.
3. The fields values obtained by the QualitizedDef serve as the values for each input specified in the INPUTS attribute of the QualitizerDef. The field values and inputs are each
organized as ordered items in an array. They may be then mapped by matching the field value at each index in the fields array to the input with the same index in the inputs array and placing each corresponding pair into a key-value dictionary, in which the key is the name of the input and the value is the field value.
4. Each DiscretizerDef in the DISCRETIZERJ3EFS attribute of the QualitizerDef may be used to discretize one or more of the inputs to one or more outputs, each of which will be placed into a field in a distinct Quality. The Quality will be of the type specified by the QUALITY TYPE attribute of the QualitizerDef. In this example, there must be one DiscretizerDef for each field in the QualityDef specified by the QualityType. If the QualityDef specifies more than one field for the QualityType, then the cartesian product of the the fields must be created and all possible combinations of the discretized values must be mapped to a corresponding number of Quality instances.
5. Each resulting Quality that corresponds to each QualitizedDef is packaged within a Quantized data structure that also includes the identity value specified by the IDENTITY attribute of the QualitizedDef. This makes it possible to distinguish values of similar structure, type, or appearance that were derived from different fields, or using different Qualization or Discretization processes (e.g. the ability to differentiate between the callingParty and the calledParty, in the foregoing example).
In one example, if a TelephoneCall record with values (callingParty:" 1234", calledParty:"4321", callStart:"2017-01-01T12:01 :02Z", durationa l ) is submitted to the system according to the present invention to be persisted, the following operations may be performed:
"Quantization of callingParty"
1. QualitizedDef with identity callingParty specifies QualitizerType TelephoneNumber
2. QualitizedDef with identity callingParty specifies callingParty field from StructuredType TelephoneCall should be mapped as the first and only input to QualitizerType TelephoneNumber
3. QualitizerType TelephoneNumber specifies that the first and only input should be mapped as the first and only parameter to the NumericStringDiscretizer
4. QualitizerType TelephoneNumber specifies that the output of NumericStringDiscretizer maps to the first and only field of the output Quality of type TelephoneNumber
5. QualitizedDef specifies that each Quality output by its Quantization process will be packaged within a Qualitized instance with identity callingParty
6. The final result will be the following object:
Qualitized : {
IDENTITY: callingParty,
QUALITY: { YPE : TelephoneNumber , FIELDS : ["1234"] }
}
"Quantization of calledParty"
7. QualitizedDef with identity calledParty specifies QualitizerType TelephoneNumber
8. QualitizedDef with identity calledParty specifies calledParty field from StructuredType TelephoneCall should be mapped as the first and only input to QualitizerType TelephoneNumber
9. QualitizerType TelephoneNumber specifies that the first and only input should be mapped as the first and only parameter to the NumericStringDiscretizer
10. QualitizerType TelephoneNumber specifies that the output of NumericStringDiscretizer maps to the first and only field of the output Quality of type TelephoneNumber
1 1. QualitizedDef specifies that each Quality output by its Quantization process will be packaged within a Qualitized instance with identity calledParty
12. The final result will be the following object:
Qualitized: {
IDEN ITY : calledParty,
QUALITY: {TYPE : TelephoneNumber, FIELDS : [ "4321"] }
}
[0068] Turning back to gateway subsystem 101 , the subsystem is adapted to transmit the secure data (together with the unique identifier) to server subsystem 102. The server subsystem 102 may itself hold, or be in communication with, a hold for encrypted data stored by users. The server subsystem 102 may further perform query operations that are sent by the client subsystem 100. These operations filter the data requested by the user and return the results to the gateway subsystem 101 so that they can be transformed back into cleartext format.
[0069] Server subsystem 102 may include a secure server service 160, secure data store 170, and may preferably have a secure connection (e.g., over a network) with gateway subsystem 101, e.g., interface 180. Interface 180 may provide a secure connection between gateway subsystem 101 and server subsystem 102 so that the first client data set that is stored temporarily in client service 120 before the schema is applied by transformation function 130 (and the results one-way encrypted) is not accessible through server subsystem 102. As will be appreciated by one of ordinary skill in the art, the secret keys and secret salts are only present within the gateway subsystem 101 and never exposed in unencrypted form to the server subsusystem 102. Unsecured client data is preferably only available in temporary memory storage in the gateway subsystem 101 before it is transformed and one-way encrypted. After the unsecured client data has been transformed, it may cease to exist in the gateway subsystem 101 , and at that point, the system would only be storing the secure data in the secure data store 170.
[0070] Server service 160 may provide connection to and management of secure data store 170, which comprises compute cluster 171 and database 172, and which may be, in one form,
a NoSQL cloud-based database. Once a schema-based data structure has been transformed to create secure data, it may be passed through interface 180 to server service 160 and then channeled to secure data store 170 for, e.g., secure storage in database 172. Secure data store 170 may be deployed, for example, on the same or a different secure cloud.
[0071] This data stored in database 172 can be acted upon by query routines. In some embodiments, query routines consist of (a) hierarchies of logic operations and (b) criteria parameters that have been transformed using, e.g., the same secret keys and secret salts that were used to transform the first client data set.
[0072] More particularly, returning now to client instance 1 10, client query engine 105 is adapted to create queries to be run on secure data (e.g., secure data stored in database 172). Client query engine 105 may use an object query database language or any appropriate query language known to those of skill in the art. Query types may include "read" and "write," in order to, for example, read data from storage, perform a manipulation on the data (e.g., filter, extract, join), or return data as a set of results.
[0073] A user may use the user interface 106 to access client query engine 105 to create and deliver queries to client service 120, and in particular, secure query generator 140. The queries can include search requests to locate data within database 172 that meet one or more search parameters. Each search parameter can be a value, or a range of values where the search request is to locate database entries that satisfy or are likely to satisfy the value or range of values. For example, each search parameter can be a time or time interval. A database entry satisfies a single value search parameter when the database entry has a corresponding time field that contains the single value. A database entry satisfies a range value search parameter when the database entry has a corresponding time field that contains a time value that is within the range of values specified by the search parameter, which can include the outer boundaries of the range of values. The search parameters can be any
appropriate parameter evident to one of ordinary skill in the art, e.g., a unit of distance or some other unit of measurement.
[0074] After receiving the client query from the client instance 1 10, secure query generator 140 may apply a schema to the client query, thereby creating a converted query. Secure query generator 140 may then hash the converted query to create a transformed query.
[0075] Client service 120 transmits the transformed query to server subsystem 102 to be executed by compute cluster 171. Compute cluster 171 accesses the relevant secure data in secure database 172 and executes the query. The query may produce results. Compute cluster 171 returns the results to the secure query generator 140 in client service 120. Secure query generator 140 may then return the secure results directly to the client instance 1 10. Alternatively, secure query generator 140 uses the unique identifier(s) associated with the secure data to "reverse transform" the secure results in order to identify the cleartext values that correlate to the secure results prior to transmission to client instance 1 10. In one embodiment, the various descriptors that are used in the query and analysis process in the server subsystem are one-way encrypted using a hash function and hence cannot be decrypted. While those values are used for query and analysis processing in the server subsystem, they are replaced with the symmetrically encrypted complete data records with which they are associated prior to being returned as results.
[0076] In some embodiments, the client instance 1 10 may have a distinct client instance from the initial client instance which uploaded the client data set. In this case, only correlation information could be made available to the distinct client instance, and not cleartext data. This arrangement may conveniently allow the viewing of analysis results by users of the distinct client instance who never had access to the raw data of the first client data set.
[0077] Turning now to Fig. 2, an exemplary method 200 for storing secure data is shown. In step 201 , a user accesses a user interface of a client instance, and generates a first client data
set using a data gathering agent, which may have access to unsecure data having fields corresponding to unsecure data field definitions that are stored in, e.g., a data store. Step 201 may optionally include discretization of the values in the first client data set as part of the gathering process. In one embodiment, shown in step 202, the first client data set is then optionally encrypted. In another embodiment, shown in step 203, the user optionally configures the first client data set based on a schema that defines associations between different fields or categories within the data set. The user may also customize the schema based upon first client data set, or a desired query. In step 204, the first client data set (in either a clear format, or in an decryptable format) is transmitted to a client service using a secure tunnel. Upon receipt, a transformation function applies a schema to the first client data set to create a schema-based data structure, as shown in step 205. In step 206, the transformation function one-way encrypts (e.g., hashes) the schema-based data structure to generate secure data. In one embodiment, the transformation function generates the secure data by applying a (preferably secure) hashing algorithm to the schema-based data structure as part of the transformation process. Alternatively, the transformation function causes a separate encrypter to hash the schema-based data structure to create the secure data. In step 207, the transformation function generates a unique identifier to provide a mapping from the secure data back to the data that resulted in the hashed values. In step 208, the secure data (together with the unique identifier) is transmitted to a server subsystem through a secure connection, and then channeled to secure storage in a database, as shown in step 209.
[0078] An exemplary method 300 for querying secure data is shown in Fig. 3. A user accesses a client query engine to create a query in step 301 to be run on secure data. The client query is transmitted to a secure query generator in step 302. In some embodiments, a non-repudiated signature is transmitted along with the queries. Both the query and the corresponding signature are then forwarded to an externally-administered audit service.
[0079] After receiving the client query, the secure query generator applies a schema to the client query as shown in step 303, thereby creating a converted query. Secure query generator then one-way encrypts the parameters of the converted query (e.g., data types, condition values, and field identities) to create a transformed query in step 304 and transmits the transformed query in step 305 to be executed by a compute cluster on relevant secure data in step 306. In some embodiments the transformed query is accompanied by the corresponding signature. A notification is sent to an audit service; the audit service verifies that the query was previously forwarded. Any discrepancies or inconsistencies between audit information reported by the two subsystems results in a security notification.
[0080] If secure results are generated from the execution of the query, as shown in step 306, the secure results are returned to the secure query generator in step 307. Secure query generator may then return the secure results directly to the client instance as shown in step 309, or, alternatively, in step 308, use a unique identifier to "reverse transform" the secure results in order to identify the cleartext values that correlate to the secure results, prior to transmission to a user in step 308. The user is presented with the two-way encrypted original record that was associated with the encrypted values.
[0081] Turning now to Fig. 4, in a further embodiment, the system may be configured as a system 400 for querying already-stored data. Such an exemplary system for searching secure data according to the present invention may comprise a schema store 401 , a transformation function 402, a secure query generator 403, and a secure data store 404. In some embodiments, the schema store 401 may comprise a plurality of unsecure data field definitions and an identifier. The transformation function 402 may be adapted to receive unsecure data 405 and generate a transformed data set 406 comprising a token and an identifier and to store the transformed data set in the secure data store 404. Further, the secure data store may be adapted to receive secure pattern-matching queries 407, and
additionally in some embodiments, the transformed data sets. The secure query generator 403 may be adapted to receive unsecure queries 408 based on the unsecure data field definitions and convert the unsecure data field definitions into secure queries 407, and then transit the secure queries 407 to the secure data store 404. Secure data store 404 may execute the secure queries on the secure data 406, and generate results. Secure data store 404 may then return the results 409 to secure query generator 403, which may then channel the results to the client subsystem, or, using the identifier, "reverse transform" the results and return, e.g., cleartext results 410 to the client instance.
[0082] In an alternate embodiment, the system may comprise a system for securely storing data for data analytics. The system may comprise a schema store 401, a transformation function 402, and a secure data store 404. The schema store 401 may comprise a plurality of unsecure data field definitions with the transformation function being adapted to receive unsecure data 405 based on those unsecure data field definitions. After receiving the unsecure data 405, the transformation function may generate a one-way encrypted data set by securely hashing the unsecure data and generate an identifier for the one-way encrypted data, and then transmitting both the one-way encrypted data and the identifier in the secure data store 404. In a preferred embodiment, the secure data store 404 may be searched for patterns among the one-way encrypted data set without the secure data store containing any unsecure data.
[0083] As will be appreciated by one of ordinary skill in the art, the secret keys and secret salts are only present within the gateway subsystem and never exposed in unencrypted form to other subsystems. And after the unsecured client data has been one-way encrypted, it may cease to exist in the gateway or server subsystems, and at that point, those subsystems would only be storing the secure data in the secure data store 404.
[0084] In another embodiment, a hashing key used for the transformation process may be stored in the gateway subsystem, and thus is never available to either the client subsystem or the server subsystem, and may preferably prevent any decoding of the secure data without gaining access to the hashing key from the gateway subsystem.
[0085] In a further embodiment, a user may choose to store encrypted raw data within the system along with the secure data. In such an instance, both data would remain securely stored, because the encryption key and the hashing keys may be stored exclusively in the gateway subsystem. Such embodiments allow querying of the transformed data, but return of the encrypted data with the transformed data as part of the query results.
[0086] Embodiments of the inventions herein described may be further illustrated by examples. In one example, the schema applied by client service 120 is comprised of a data organization schema analogous to a biological olfactory system, in which case such a schema may be referred to as the olfactic model. In the natural world, olfaction (the process of smelling) is made possible by a collection of receptor cells within the nasal cavity. Each receptor is able to bind with specific types of molecules. A scent is detected when molecules pass through the nasal cavity and bind to matching receptor cells. This binding causes the neurons that correspond to the receptor to fire. A neuron firing is analogous to setting a bit to one in a binary sequence. A unique combination of firing neurons indicates a distinct scent to the brain.
[0087] Such an olfactic model can comprise "aromatics," "arotopes," "odorants," "odotopes," and "odorizers."
[0088] An aromatic is analogous to a data record or row in a relational database. An arotope is a sub-secondary data structure (e.g., an attribute of an aromatic) comprising an arotope identity value, arotope data type, and arotope value. An odorant is another sub-secondary data structure comprising an odorant identity, odorant type, and odorant value, and represents
a single descriptive characteristic derived from an aromatic. In some embodiments, one or more odorants are derived from every aromatic, and collectively, they form a complete description of the aromatic from which they were derived. Odorants can be one-way ciphered such that the ciphered values of any two odorants with the exact same odorant type and identical vectors of odotopes will be equal and can be efficiently tested for such equality. In some embodiments, the odorant may be derived from the arotope values, wherein the odorants of the same odorant type have identical vectors of odotopes.
[0089] An odotope is another sub-secondary data structure comprising an odotope identity, odotope type, and odotope value. An odotope may correspond to a category of data fields within an initial client data structure. An odorizer is another sub-secondary data structure comprising an odorizer identity value and a vector of category functions, such as, without limitation, discretizer functions adapted to create a discrete generalization hierarchy of the data fields within a first client data set. In some embodiments, the odorizers may comprise a second set of data identifiers, and the generalization hierarchy may correspond to category functions. The category functions may be customized or user-defined.
[0090] In this example, the client data set comprises at least one aromatic. An aromatic represents the combination of data type and field values in the client data set. Each record in the client data set is treated like an aromatic molecule. Two aromatics possess identical "shapes" if, and only if, they are of the same type and have identical field values.
[0091] An aromatic can be characterized by odorants, which is a discrete representation of one distinct characteristic of the aromatic and can be tested for equality with other odorants. For example, the aromatic designates a data record and the odorant designates a descriptive characteristic of the data record. This is analogous to the binding of molecules to receptor cells in biological olfaction, as two surface sections of an aromatic molecule must be identical in shape in order to be considered a match.
[0092] Each aromatic is stored in symmetrically-encrypted form, which must be decrypted before it can be used in any meaningful way. Encrypted query and analysis is made possible by the association of one or more odorants with each aromatic. The odorants derived from an aromatic effectively describe it as a collection of discrete values that can be compared with other odorants as criteria values using one or more logic operations. Because odorant values are only tested for equality with other odorants, odorants can be transformed using any available cipher algorithm that will consistently produce the same output value from a given input value.
[0093] Each encrypted aromatic is indexed using one or more odorants that have been encrypted using a one-way cipher algorithm. A reverse index is available so that a known encrypted aromatic value can be used to obtain all encrypted odorants that were derived from it. Query and analysis operations can be performed upon the aromatic by using the associated encrypted odorant values as a ciphered network of facts. The ciphered network of facts that describe and represent an aromatic serves as its proxy for query and analysis operations that must be performed while the aromatic remains encrypted. Upon completion of a query or analysis routine, one or more aromatics are returned as the result set. Throughout the entire query and analysis process, the aromatics in the database and their corresponding odorants (as well as the odorants that were provided as criteria), remain encrypted. Once transferred to an administratively separate secure system like the gateway or client subsystems described above, the final result set can be decrypted using secret key information.
[0094] In another example, an embodiment of the invention comprises a transformation process whereby a client data set is transformed to a representation in the form of a ciphered network of facts. This representation of the client data set can be queried and analyzed using instructions and parameters that have undergone the same transformation process, without the need to decrypt the data being analyzed. The transformed client data set is a complete
representation of the original pre-transformed client data set, and can be transformed back into its original cleartext format without the need to maintain cumbersome translation tables such as those required by data masking and traditional tokenization mechanisms.
[0095] Arotopes from an aromatic that are specified for an odorizer within a schema are input in an odorizer. The odorizer outputs one or more odorants. A single odorizer may output more than one odorant. Each odorant output by a single odorizer will have the same odorant type and odorizer identity, which is an identifier that is unique within the scope of an aromatic-type and uniquely identifies the collection of arotopes and the odorizer that is discretizing them into odorant(s).
[0096] The original aromatic along with all of the outputted odorants are put into an odorized-aromatic object. The aromatic is then serialized. A secret salt that is stored with set of secret salts (used in the transformation process for each aromatic type) is appended to the serialized aromatic. The resulting value is ciphered using a secure hash function to form a value called the Ate. Then, the aromatic is serialized. A symmetric cipher is applied to the serialized aromatic to form a variable-length binary value called the Ac.
[0097] The aromatic type is then serialized. A secret salt is appended to the serialized aromatic type. The resulting value is ciphered using a secure hash function to form a value called the AtO. Each odorant is then serialized. A secret salt that is stored with set of secret salts used in the transformation process for each odorant type is appended to the serialized odorant. The resulting value is ciphered using a secure hash function to form a value called the Ot.
[0098] The odorant type is then serialized. A secret salt is appended to the serialized odorant type. The resulting value is ciphered using a secure hash function to form a value called the OtO. The odorizer identity is then serialized. A secret salt that is stored with set of secret salts used in the transformation process for each for each aromatic type is appended to the
serialized odorizer identity. The resulting value is ciphered using a secure hash function to form a value called the Psi. The odorant discretization order is obtained from the odorant as an array of one-byte signed values called the Rho.
[0099] Each Ot, OtO, Psi and Rho are put into an Aot.Vt object. The Ate, Ac, AtO, and all Aot.Vt(s) are put into an Aot object. The Aot object is the final result of the transformation process from query Qc object to query Qs object.
[00100] The above description illustrates various embodiments of the present invention along with examples of how aspects of the present invention may be implemented. The above examples and embodiments should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the present invention as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents will be evident to those skilled in the art and may be employed without departing from the spirit and scope of the invention as defined by the claims.
Claims
1. A system for the analysis of secure data comprising:
one or more client instances in communication with a gateway subsystem,
wherein each said client instance is adapted to,
define a client data set configured based on a first schema comprising at least a first data structure,
transmit said client data set to said gateway subsystem,
define a client query based on said client data set and said first schema, and,
transmit said client query to said gateway subsystem;
said gateway subsystem, comprising,
a schema store comprising said first schema,
wherein said first schema comprises a generalization hierarchy adapted to define identification values to portions of said client data set based on analysis parameters related to said generalization hierarchy,
a transformation function adapted to
receive said client data set from said client instance and apply said first schema to said client data set to create a schema-based data structure,
a query generator adapted to
receive said client query from said client instance, and apply said first schema to said client query to create a converted query, and
a transformation function adapted to
generate secure data by applying a hashing algorithm to said schema-based data structure,
generate a transformed query by applying a hashing algorithm to said converted query, and,
transmit said secure data and/or transformed query to a cloud- based server system, said cloud-based server system being in communication with said client service; and
said cloud-based server system, comprising,
a server service adapted to receive said secure data, and
a secure data store adapted to
receive and store said secure data,
receive and execute said transformed query to provide at least a first result, said first analysis result comprising at least a first value identified within said secure data, and
transmit said at least first analysis result to said client service, wherein said client service transmits said at least first analysis result to said client instance.
2. The system of 1 wherein said schema-based structure comprises:
a primary data structure type, and
a vector of first secondary data structures, said vector of first secondary data structures comprising,
at least one first secondary data structure identity,
at least one first secondary data structure type,
at least one first secondary data structure value; and
a vector of second secondary data structures, said vector of second secondary data structures comprising,
at least one second secondary data structure identity,
at least one second secondary data structure type, and
at least one second secondary data structure value derived from a first data subset, said first data subset comprising,
a first data subset structure type, and
a vector of a second data subset, said vector of a second data subset comprising,
at least one second data subset identity,
at least one second data subset structure type, and at least one second data subset value, and
wherein the first data subset corresponds to said primary data structure type, wherein said first data subset is derived from at least one first secondary data structure value, wherein data of said first data subset of the same first data subset structure type have identical vectors of said second data subsets, and said at primary data structure is capable of deriving at least one said first data subset.
3. The system of claim 1 , further comprising at least a second client instance in
communication with said gateway subsystem.
4. The system of claim 1 , further comprising a plurality of client instances in
communication with said gateway subsystem.
5. A system for securely storing and analyzing secure data, comprising:
at least one client instance comprising a data store, a client query engine, and a data gathering engine;
a gateway subsystem comprising a transformation function, a secure query generator, and a schema store comprising one or more unsecure data field definitions, wherein said transformation function comprises a transformation function adapted to receive unsecure data based on said unsecure data field definitions, generate a transformed data set by securely hashing said unsecure data, and generate an identifier for said transformed data set; and a server subsystem comprising a searchable secure data store adapted to store said transformed data set and said identifier.
6. The system of claim 5, wherein said data gathering agent accesses unsecure data having fields corresponding to said unsecure data field definitions.
7. The system of claim 6, wherein said data gathering agent is adapted to transmit unsecure data to said transformation function based on said unsecure data field definitions.
8. The system of claim 7, wherein said data gathering agent is adapted to encrypt said
unsecure data prior to transmission and said transformation function.
9. The system of claim 5, wherein said transformation function is adapted to discretize said unsecure data.
10. A system for performing data analytics on secure data comprising:
a schema store comprising a plurality of unsecure data field definitions,
a secure data store comprising secure data and identifiers,
a secure query generator adapted to,
receive a query based upon said unsecure data field definitions,
convert said query to a secure query adapted to return identifiers of secure data, return results from said secure query, whereby said secure query executes on said secure data and patterns identified in said secure data correspond to patterns in unsecure data stored according to said unsecure data field definitions.
1 1 . The system of claim 10 wherein said secure data further comprises encrypted data.
12. The system of claim 10 further comprising a client query agent having access to unsecure data having fields corresponding to said unsecure data field definitions, said client query agent being adapted to transmit said query to said secure query generator and receive said results.
13. A system for searching secure data comprising:
a schema store comprising a plurality of unsecure data field definitions and an identifier;
a transformation function adapted to receive unsecure data and generate a transformed data set comprising a token and one or more identifier;
a secure query generator adapted to
receive unsecure queries based on said unsecure data field definitions, convert said unsecure data field definitions into said secure pattern- matching queries, and
transit said secure pattern-matching queries to said secure data store; a secure data store adapted to receive secure pattern-matching queries, said transformed data sets, and said one or more identifiers; and
a secure data store adapted to store said transformed data sets and said one or more identifiers.
14. A method for storing secure data, comprising the steps of:
generating a client data set using a data gathering agent in communication with a data store having unsecure data having fields corresponding to unsecure data field definitions; configuring said client data set based on a schema that defines associations between said fields corresponding to unsecure data field definitions;
encrypting said client data set;
applying a schema to said client data set to create a schema-based data structure;
one-way encrypting said schema-based data structure to create secure data;
generating an identifier to provide a mapping from said secure data back to said client data set;
secure said secure data together with said identifier in a database.
15. The method of claim 14, wherein the one-way encrypting step includes applying a hashing algorithm.
16. A method for querying secure data comprising the steps of:
accessing a client query engine to create a query to be executed on secure data;
transmitting said query to a secure query generator;
applying a schema to said query to create a converted query;
transforming the converted query to create a transformed query;
executing said transformed query on said secure data;
returning results from the executed query to said secure query generator; and using an identifier to reverse-transform the results to identify the cleartext values that correlate to said results.
17. The method of claim 16, further comprising the step of transmitting a non-repudiated signature with the query.
18. The method of claim 17, further comprising the step of transmitting said non-repudiated signature and query to an audit service.
19. A system for querying stored data, comprising:
at least one aromatic comprising,
a data structure type,
a vector of arotopes, said arotopes comprising an arotope identity, arotope type, and arotope value, and
a vector of odorizers, said odorizers comprising an odorizer identity and a vector of category functions, said odorizers being derived from a collection of odorants, said odorants comprising an odorant type and vector of odotopes, said odotopes comprising an odotope identity, odotope type, and odotope value,
wherein said collection of odorants correspond to said data structure type, wherein said odorants are derived from at least one arotope value, wherein odorants of the same odorant type have identical vectors of odotopes, and said at least one aromatic is capable of deriving at least one odorant.
20. The system of claim 19, wherein said aromatic comprises a first data structure, and said arotopes comprise a first set of data identifiers from said first data structure.
21. The system of claim 20, wherein said odorizers comprise at least a second set of data identifiers from said first data structure, and wherein said category functions define said generalization hierarchy of said first schema.
22. The system of claim 21, wherein said category functions are user-defined.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662414081P | 2016-10-28 | 2016-10-28 | |
US62/414,081 | 2016-10-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018080857A1 true WO2018080857A1 (en) | 2018-05-03 |
Family
ID=62025384
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2017/057075 WO2018080857A1 (en) | 2016-10-28 | 2017-10-18 | Systems and methods for creating, storing, and analyzing secure data |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018080857A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210160048A1 (en) * | 2019-11-27 | 2021-05-27 | Duality Technologies, Inc. | Recursive algorithms with delayed computations performed in a homomorphically encrypted space |
US20220019687A1 (en) * | 2019-06-13 | 2022-01-20 | Phennecs, LLC | Systems for and methods of data obfuscation |
US11321382B2 (en) | 2020-02-11 | 2022-05-03 | International Business Machines Corporation | Secure matching and identification of patterns |
US12197615B2 (en) | 2022-07-19 | 2025-01-14 | IronCore Labs, Inc. | Secured search for ready-made search software |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5623601A (en) * | 1994-11-18 | 1997-04-22 | Milkway Networks Corporation | Apparatus and method for providing a secure gateway for communication and data exchanges between networks |
US20130041888A1 (en) * | 2005-12-01 | 2013-02-14 | Firestar Software, Inc. | System and method for exchanging information among exchange applications |
US20130191650A1 (en) * | 2012-01-25 | 2013-07-25 | Massachusetts Institute Of Technology | Methods and apparatus for securing a database |
US20140280260A1 (en) * | 2013-03-15 | 2014-09-18 | Eric Boukobza | Method, apparatus, and computer-readable medium for data tokenization |
US20140280193A1 (en) * | 2013-03-13 | 2014-09-18 | Salesforce.Com, Inc. | Systems, methods, and apparatuses for implementing a similar command with a predictive query interface |
US8925087B1 (en) * | 2009-06-19 | 2014-12-30 | Trend Micro Incorporated | Apparatus and methods for in-the-cloud identification of spam and/or malware |
US20150244517A1 (en) * | 2012-03-26 | 2015-08-27 | Newline Software, Inc. | Computer-Implemented System And Method For Providing Secure Data Processing In A Cloud Using Discrete Homomorphic Encryption |
-
2017
- 2017-10-18 WO PCT/US2017/057075 patent/WO2018080857A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5623601A (en) * | 1994-11-18 | 1997-04-22 | Milkway Networks Corporation | Apparatus and method for providing a secure gateway for communication and data exchanges between networks |
US20130041888A1 (en) * | 2005-12-01 | 2013-02-14 | Firestar Software, Inc. | System and method for exchanging information among exchange applications |
US8925087B1 (en) * | 2009-06-19 | 2014-12-30 | Trend Micro Incorporated | Apparatus and methods for in-the-cloud identification of spam and/or malware |
US20130191650A1 (en) * | 2012-01-25 | 2013-07-25 | Massachusetts Institute Of Technology | Methods and apparatus for securing a database |
US20150244517A1 (en) * | 2012-03-26 | 2015-08-27 | Newline Software, Inc. | Computer-Implemented System And Method For Providing Secure Data Processing In A Cloud Using Discrete Homomorphic Encryption |
US20140280193A1 (en) * | 2013-03-13 | 2014-09-18 | Salesforce.Com, Inc. | Systems, methods, and apparatuses for implementing a similar command with a predictive query interface |
US20140280260A1 (en) * | 2013-03-15 | 2014-09-18 | Eric Boukobza | Method, apparatus, and computer-readable medium for data tokenization |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220019687A1 (en) * | 2019-06-13 | 2022-01-20 | Phennecs, LLC | Systems for and methods of data obfuscation |
US20210160048A1 (en) * | 2019-11-27 | 2021-05-27 | Duality Technologies, Inc. | Recursive algorithms with delayed computations performed in a homomorphically encrypted space |
US11616635B2 (en) * | 2019-11-27 | 2023-03-28 | Duality Technologies, Inc. | Recursive algorithms with delayed computations performed in a homomorphically encrypted space |
US11321382B2 (en) | 2020-02-11 | 2022-05-03 | International Business Machines Corporation | Secure matching and identification of patterns |
US11663263B2 (en) | 2020-02-11 | 2023-05-30 | International Business Machines Corporation | Secure matching and identification of patterns |
US11816142B2 (en) | 2020-02-11 | 2023-11-14 | International Business Machines Corporation | Secure matching and identification of patterns |
US12197615B2 (en) | 2022-07-19 | 2025-01-14 | IronCore Labs, Inc. | Secured search for ready-made search software |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2018367363B2 (en) | Processing data queries in a logically sharded data store | |
Mehmood et al. | Protection of big data privacy | |
CN102236766B (en) | Security data item level database encryption system | |
Asghar et al. | Supporting complex queries and access policies for multi-user encrypted databases | |
KR100839220B1 (en) | Encrypted database retrieval method and system | |
US11256662B2 (en) | Distributed ledger system | |
AU2017440029B2 (en) | Cryptographic key generation for logically sharded data stores | |
CN108632385A (en) | Multiway tree data directory structure cloud storage method for secret protection based on time series | |
WO2018080857A1 (en) | Systems and methods for creating, storing, and analyzing secure data | |
CN119397578A (en) | Blockchain data management method and system | |
Liu | Securing outsourced databases in the cloud | |
Purushothama et al. | Efficient query processing on outsourced encrypted data in cloud with privacy preservation | |
EP3704617B1 (en) | Privacy-preserving log analysis | |
Bhukya et al. | Data security in cloud computing and outsourced databases | |
Sun et al. | Research of data security model in cloud computing platform for SMEs | |
Fugkeaw et al. | EVSEB: Efficient and Verifiable Searchable Encryption with Boolean Search for Encrypted Cloud Logs | |
Li et al. | Verifiable range query processing for cloud computing | |
Hamdi et al. | A security novel for a networked database | |
Raja et al. | An enhanced study on cloud data services using security technologies | |
Kodada et al. | FSACE: finite state automata-based client-side encryption for secure data deduplication in cloud computing | |
Astaburuaga | Privacy Preserving Cyber Threat Intelligence Sharing Framework for Encrypted Analytics | |
Almobaideen et al. | Searchable encryption architectures: survey of the literature and proposing a unified architecture | |
Pavithra et al. | Enhanced Secure Big Data in Distributed Mobile Cloud Computing Using Fuzzy Encryption Model | |
Rady et al. | SCIQ-CD: A Secure Scheme to Provide Confidentiality and Integrity of Query results for Cloud Databases | |
Reddy et al. | Cryptographic key management scheme for supporting multi-user SQL queries over encrypted databases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17866262 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17866262 Country of ref document: EP Kind code of ref document: A1 |