Background
In recent years, deep learning has received extensive attention from academia and industry because of its ability to mine useful knowledge from large-scale data. Deep learning has been applied in various fields and has made a number of remarkable breakthroughs and developments. However, few studies have focused on the importance of privacy concerns, as a large amount of data is involved, and privacy concerns become more important than ever before. For example, medical data may include patient private data such as disease, family history, and DNA sequences. While some banking-like financial institutions store sensitive information for many customers, exposure to leakage once such data is analyzed can result in immeasurable losses and even threatens personal safety. Therefore, these technologies should pay more attention to the privacy disclosure, which is a potential problem, while benefiting mankind to accelerate social development.
In recent years, several deep learning methods for protecting privacy have been proposed by scholars, however, most of them suffer from reduced efficiency or reduced accuracy. In particular, differential privacy-based methods protect data privacy by adding noise, which can affect the accuracy and practicality of the data. Homomorphic encryption based methods generally require high computational costs and efficiency can become intolerant in scenarios with large-scale data.
Negative Database (NDB) is a new form of information representation, the inspiration of which comes from negative selection mechanisms in the artificial immune system. NDB stores information in a supplementary set of DB to achieve privacy protection, and it can also support operations of insertion, deletion, update, and selection, like a conventional database. Reverse negative database recovery of raw data has proven to be an NP-hard problem. Furthermore, it supports coarse distance estimation. These characteristics make it suitable for many areas of privacy, for example. Password authentication, information hiding, biometric authentication, data mining and other fields.
Although effective privacy protection can be achieved by combining the negative data with deep learning, the privacy protection degree is also possible to be improved under the condition that parameter setting is extreme. The exclusive-or operation is applied to various scenes as a simple and efficient operation, and if the exclusive-or operation is performed by using a binary string generated randomly before converting the data into a negative database, the privacy of the data is further protected.
Disclosure of Invention
In order to solve the technical problems, the invention provides a data privacy protection method and system based on a ciphertext-based negative data combined with a deep learning algorithm.
The technical scheme adopted by the invention is as follows: a kind of negative database and data privacy protection method of deep learning based on ciphertext, adopt the data privacy protection model to protect the data privacy;
The data privacy protection model comprises the following steps:
Step 1: preprocessing the original data and converting the original data into a binary string X= { X 1…Xn };
Step 2: performing exclusive or encryption on the secret key K with the specified length and the data processed in the step 1 to obtain encrypted data X '= { X 1'…Xn' };
step 3: selecting a negative database generation algorithm, and generating a corresponding negative database NDB= { NDB 1…NDBn } according to the data X '= { X 1'…Xn' } encrypted in the step 2;
step 4: extracting an outline s= { S 1...Sn } of the negative database from step 3, wherein S i is an outline of NDB i;
Step 5: based on the obstruction S, the activation function estimation based on the negative database is completed, the deep learning network is trained until the network converges, and a trained data privacy protection model is obtained.
Aiming at the characteristics of the current negative database and exclusive or operation, the invention provides a ciphertext-based negative database privacy protection method, which solves the problems that the privacy and the utilization rate are difficult to balance due to differential privacy, and the calculation cost is overlarge due to homomorphic encryption and other methods, has stronger robustness, and comprehensively improves the efficiency and the precision in the privacy protection deep learning process.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
Referring to fig. 1 and fig. 2, the invention provides a data privacy protection method for a ciphertext-based negative database and deep learning, which adopts a data privacy protection model to protect data privacy;
The data privacy protection model of the embodiment, the acquisition process includes the following steps:
Step 1: preprocessing the original data and converting the original data into a binary string X= { X 1…Xn };
Step 2: performing exclusive or encryption on the secret key K with the specified length and the data processed in the step 1 to obtain encrypted data X '= { X 1'…Xn' };
In this embodiment, the original data x= { X 1…Xn } is encrypted as X '= { X 1'…Xn' } using a randomly generated key K;
The key K is a binary string of len xor, and if x=x k, the j-th bit of the i-th attribute of X will be xored with the (i×l+j)% len xor bit of the key K Where L represents the length of the attribute. Fig. 3 shows an exclusive-or encryption case.
According to the embodiment, keys with different lengths can be generated according to the characteristics of the original data so as to achieve privacy protection of different degrees. The present embodiment employs a QK-hidden negative database generation algorithm that can control the distribution of negative database records with finer granularity through a set of parameters Q, thereby making the computation more accurate.
Step 3: selecting a negative database generation algorithm, and generating a corresponding negative database NDB= { NDB 1…NDBn } according to the data X '= { X 1'…Xn' } encrypted in the step 2;
The present embodiment generates a corresponding negative database ndb= { NDB 1…NDBn } for the encrypted data X ' = { X 1'…Xn ' }, using the QK-hidden algorithm, where NDB i (i=1..n) is the negative database generated from the ciphertext X i '.
Step 4: extracting the outline s= { S 1...Sn } of the negative database from the step 3, wherein S i is the outline of NDB i, and uploading the outline to a high-performance server;
The present embodiment extracts the summary s= { S 1...Sn } from the negative database in NDB, where S i is the summary of NDB i, and uploads S and tag data y= { y 1…yn } to the server.
The present embodiment improves efficiency by extracting the tabs of the negative database. The sketch is used as a two-dimensional array, the number of the negative databases recorded as '0' and '1' in each bit is stored, and the security is improved while the negative databases are compressed.
Step 5: and (3) the server receives the sketch uploaded in the step (4), completes activation function estimation based on the negative database, trains the deep learning network until the network converges, and obtains a trained data privacy protection model.
Please refer to fig. 4, in this embodiment, since the search obtained by the server is input to the neural network instead of the original privacy data, the activation function cannot be calculated again by the original data. Therefore, for activation functions such as Sigmoid, reLU, and tanh, they were initially calculated on the neural network as follows:
Where z represents the linear computation in the neuron, expressed as X=x 1…xM represents the original privacy data, x∈x, M represents the number of input X attributes, and w 1、…、wM represents the weight parameters in the neural network;
If the input is x and the negative database generated by the input is NDB x, calculating the probability P diff [ i ] that the ith bit recorded in the NDB x is different from x according to the formula (4);
K represents that K types of negative database records exist, wherein the type of negative database record in the ith type has i determination bits, p j represents the probability of generating the type of negative database record in the ith type, the j determination bits are opposite to the corresponding positions of the hidden strings, the remaining K-j determination bits are identical to the hidden strings, and q i represents the probability that the ith bit of the selection attribute is different from the hidden strings in the corresponding positions; l represents the number of bits of the attribute;
If it is The probability that the j-th bit of the i-th attribute in the hidden string x corresponding to the NDB x is '0' is represented and calculated through a formula (5);
Wherein, P same [ j ] represents the probability that the j-th bit recorded in the NDB x is the same as the hidden string at the corresponding position; n 0 and n 1 are the numbers of the j-th bit being '0' or '1' in all the recorded i-th attributes in NDB x, respectively, and are obtained from the result obtained in step 3; if x=x k, then n 0=Sk[i×L+j][0],n1=Sk [ i X L + j ] [1],
The probability of the ith attribute of x being d is calculated by equation (6) as:
Wherein d is more than or equal to 0 and less than or equal to 2 L -1, Is a binary representation of x i, the binary representation of d is d bin=b1...bL;
then use the formula (7) to Substituting z in formulas (1), (2) and (3) to complete the evaluation of the activation function;
Training a deep learning network, and completing forward propagation calculation by selecting data with batch size t each time through a formula (7); then back-propagating through equation (8) to calculate the gradient Finally, updating the weight W= { W 1…Wn } through a formula (9) until the parameter reaches the optimal or maximum iteration times;
Where y= { y 1…yn } is tag data corresponding to input x= { X 1…Xn }, loss represents a Loss function, and η is a learning rate.
In the test stage, the user uploads the test data to the server after the test data are transformed in the steps 1 and 2, and the server predicts the test data by using a trained model and returns the result to the client.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.