CN113938265B

CN113938265B - Information de-identification method and device and electronic equipment

Info

Publication number: CN113938265B
Application number: CN202010677009.1A
Authority: CN
Inventors: 于乐; 江为强; 张峰; 李祥军; 安宝宇
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2024-04-12
Anticipated expiration: 2040-07-14
Also published as: CN113938265A

Abstract

The application provides an information de-identification method, an information de-identification device and electronic equipment, and relates to the technical field of information security. In the embodiment of the application, the probability of success of attack on the attack side and the probability of success of defending on the defending side are determined according to the first model parameter and the second model parameter; and further determining the attack capacity of the attack side and the defending capacity of the defending side. Determining target model parameters from the selectable model parameters enabling defensive capacity to be larger than attack capacity; and establishing a de-identification model according to the target model parameters, and de-identifying the target data. The rationality of the model parameter setting of the de-identification model is improved, so that the de-identification model can not only finish de-identification of target data, but also reduce the risk of re-identification, and simultaneously ensure the usability of the target data.

Description

Information de-identification method and device and electronic equipment

[ field of technology ]

The present disclosure relates to the field of information security technologies, and in particular, to a method and an apparatus for de-identifying information, and an electronic device.

[ background Art ]

With the rapid development of network information and computer technology, the society and the information in the network are continuously developed towards information sharing and resource reciprocity. Meanwhile, in order to reduce the risk of personal information leakage in information sharing, it is necessary to perform de-identification processing on the information before it is released. The information de-identification refers to a process of removing a set of identifiable information and association relations between subjects corresponding to the information, so as to prevent leakage of personal information. When information is published, the common de-identification model mainly comprises: the key ideas of the de-identified model are as follows: each data under a data attribute is required to contain at least k (or l or t) records, forming an equivalent group, so that an attacker cannot associate with the corresponding body of the data.

The existing method for establishing the de-identification model has the following problems that when the de-identification model is established, the determination of the k (or l or t) value is based on the personal subjective intention of a model creator, the k (or l or t) value is unreasonable, the established de-identification model is difficult to complete information hiding or can complete information hiding, but the information distortion is serious, and the usability of the information is greatly reduced.

[ invention ]

The embodiment of the application provides an information de-identification method, an information de-identification device and electronic equipment, so that reasonable determination of a model parameter k (or l or t) value in a de-identification model is realized, the de-identification model can complete information hiding, and meanwhile, the usability of information can be guaranteed to the greatest extent.

In a first aspect, an embodiment of the present application provides an information de-identifying method, including: determining the probability of success of attack on the attack side according to the first model parameters, and determining the probability of success of defending on the defending side according to the second model parameters; determining the attack capacity of the attack side and the defending capacity of the defending side according to the probability of successful attack and the probability of successful defending; determining optional model parameters enabling the defending capability to be larger than the attack capability, and determining target model parameters from the optional model parameters; establishing a de-identification model according to the target model parameters; and de-identifying the target data by using the de-identifying model.

In one possible implementation manner, an attack channel is formed by simulation between a first input variable and an output variable of an attack side, and a defending channel is formed by simulation between a second input variable and an output variable of a defending side; according to the probability of success of the attack and the probability of success of the defending, determining the attack capacity of the attack side and the defending capacity of the defending side comprises the following steps: determining a first channel capacity of an attack channel according to the probability of success of the attack and the probability of success of the defending; determining the attack capacity of an attack side according to the first channel capacity; determining a second channel capacity of a defending channel according to the probability of successful attack and the probability of successful defending; and determining the defending capacity of the defending side according to the second channel capacity.

In one possible implementation manner, determining the first channel capacity of the attack channel according to the probability of success of the attack and the probability of success of the defending includes: determining a first joint probability distribution between the first input variable and the output variable of the attack channel according to the probability of success of the attack and the probability of success of the defending; and determining first mutual information between a first input variable and an output variable according to the first joint probability distribution, and determining the first channel capacity based on the first mutual information.

In one possible implementation manner, determining the second channel capacity of the defending channel according to the probability of successful attack and the probability of successful defending includes: determining a second joint probability distribution between the second input variable and the output variable of the defending channel according to the probability of successful attack and the probability of successful defending; and determining second mutual information between a second input variable and an output variable according to the second joint probability distribution, and determining the second channel capacity based on the second mutual information.

In one possible implementation manner, the second model parameter is a variable in the representation of the attack capability, and the first model parameter is a first fixed value; the first model parameter is a variable and the second model parameter is a second fixed value in the representation of the defensive capacity; determining optional model parameters that make the defensive capability greater than the attack capability, comprising: and when the first fixed value and the second fixed value take the same value, determining optional model parameters which enable the defending capability to be larger than the attack capability.

In one possible implementation, determining the size of the equivalent group in the de-identified model according to the target model parameters; and establishing a de-identification model according to the determined size of the equivalent group.

In one possible implementation manner, the de-identifying the target data by using the de-identifying model includes: grouping target data according to the size of the equivalent group in the de-identification model; and performing de-identification processing on the data corresponding to the target attribute in each group of data according to the grouping result to obtain de-identified target data.

In a second aspect, an embodiment of the present application provides an information de-identifying apparatus, including: the determining module is used for determining the probability of success of attack on the attack side according to the first model parameters and determining the probability of success of defending on the defending side according to the second model parameters; determining the attack capacity of the attack side and the defending capacity of the defending side according to the probability of successful attack and the probability of successful defending; determining optional model parameters enabling the defending capability to be larger than the attack capability, and determining target model parameters from the optional model parameters; the model building module is used for building a de-identification model according to the target model parameters; and the de-identification module is used for de-identifying the target data by utilizing the de-identification model.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, which are called by the processor to perform the method as described above.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform a method as described above.

In the technical scheme, the probability of success of attack on the attack side and the probability of success of defending on the defending side are determined according to the first model parameter and the second model parameter; and further determining the attack capacity of the attack side and the defending capacity of the defending side. Determining target model parameters from the selectable model parameters enabling defensive capacity to be larger than attack capacity; and establishing a de-identification model according to the target model parameters, and de-identifying the target data by using the de-identification model. According to the embodiment of the invention, the rationality of the model parameter setting of the de-identification model can be improved, so that the de-identification model can not only finish de-identification of target data and reduce the risk of re-identification, but also ensure the usability of the target data.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of one embodiment of a method for de-identifying information of the present application;

FIG. 2 is a flow chart of another embodiment of a method for de-identifying information of the present application;

FIG. 3 is a graph of average mutual information change of an attack channel in the method for de-identifying information according to the present application;

FIG. 4 is a graph of average mutual information amount variation of a guard channel in the method for de-identifying information according to the present application;

FIG. 5 is a graph showing the combination of attack and defense capabilities in the method for de-identifying information according to the present application;

FIG. 6 is a schematic structural view of an embodiment of the information de-identification device of the present application;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device of the present application.

[ detailed description ] of the invention

For a better understanding of the technical solutions of the present application, embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which would be apparent to one of ordinary skill in the art without making any inventive effort, are intended to be within the scope of the present application.

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

When data is published, common de-identification models are: k-anonymity model, l-diversity model, t-grafting-in model, etc. The main ideas of the de-identified model are: each data under a data attribute is required to contain at least N records, forming an equivalent group, so that an attacker cannot associate with the corresponding body of the data.

For the k-anonymity model, each data under a data attribute is required to contain at least k records; for the l-diversity model, then is the l record; for the t-patch model, t records are included. Wherein k, l and t are model parameters of each de-identified model, respectively, and the size of the equivalent group in each de-identified model can be determined according to the model parameters.

FIG. 1 is a flowchart illustrating an embodiment of a method for de-identifying information according to the present application, as shown in FIG. 1, where the method for de-identifying information may include:

step 101, determining the probability of success of attack on the attack side according to the first model parameters, and determining the probability of success of defending on the defending side according to the second model parameters.

In this embodiment, the first model parameter is a model parameter of the attack side de-identification model, and the value of the first model parameter represents the size of an equivalent group in the attack side de-identification model. The second model parameter is a model parameter of the defending side de-identification model, and the value of the second model parameter represents the size of an equivalent group in the defending side de-identification model.

According to the size of the equivalent group in the attack side de-identification model, the probability of success of attack of the attack side and the probability of failure of the attack can be determined; the probability of defending success of the defending side and the probability of defending failure can be determined according to the size of the equivalent group in the defending side de-identification model.

In one specific implementation, if the firstA model parameter is K ₁ Then the size of the equivalent group in the attack side de-identification model is K ₁ The method comprises the steps of carrying out a first treatment on the surface of the The probability of success of attack on attack side is 1/K ₁ Correspondingly, the probability of attack failure is 1-1/K ₁ . If the second model parameter is K ₂ Then the defending side de-identifies the size of the equivalent group in the model as K ₂ The method comprises the steps of carrying out a first treatment on the surface of the The probability of successful defending on defending side is 1-1/K ₂ Correspondingly, the probability of attack failure is 1/K ₂ 。

The above K ₁ And K ₂ But one exemplary representation and is not meant to limit the present embodiments. In practice, the first model parameter and the second model parameter may be set according to a specific de-identified model.

Step 102, determining the attack capacity of the attack side and the defending capacity of the defending side according to the probability of successful attack and the probability of successful defending.

First, an attack channel is simulated between a first input variable and an output variable on an attack side, and a defending channel is simulated between a second input variable and an output variable on a defending side.

In this embodiment, the attack side refers to the party that initiates the re-identification attack. When an attack is performed on the attack side, the attack procedure can be regarded as a communication procedure inside one channel. When the output information of the channel is consistent with the input information, the channel communication is considered to be successful, namely the attack side is considered to be successful in one attack. Correspondingly, the defending side refers to the party for defending the re-identification attack, and the defending process can also be regarded as a communication process inside a channel, and the channel communication is successful, namely the defending is successful.

Based on the above, the present embodiment analyzes the attack side and the defending side by constructing the attack channel and the defending channel.

Specifically, an attack event on the attack side is used as a first input variable, and a defending event on the defending side is used as a second input variable. Meanwhile, an output variable is constructed according to the corresponding relation between the first input variable and the second input variable.

Furthermore, since any two random variables can form a channel, in combination with the above description of the embodiment, an attack channel is formed by using the simulation of the first input variable and the output variable, and a defending channel is formed by using the simulation of the second input variable and the output variable.

Then, according to the probability of successful attack and the probability of successful defending, determining the first channel capacity of the attack channel; and determining the attack capacity of the attack side according to the first channel capacity. Determining a second channel capacity of the defending channel according to the probability of successful attack and the probability of successful defending; and determining the defending capacity of the defending side according to the second channel capacity.

For the attack side, the attack capability refers to the capability of successfully completing the re-identification attack. The first channel capacity represents the maximum information rate of error-free transmissions within the attack channel, i.e. the maximum capacity to achieve successful communication within the attack channel. Therefore, the first channel capacity is proportional to the attack capability of the attack side, and thus, the value of the first channel capacity is used to represent the attack capability of the attack side in this embodiment.

Similarly, for the defending side, defending capability refers to the ability to successfully defend against a re-identification attack. The second channel capacity represents the maximum information rate of error-free transmission in the defensive channel, i.e. the maximum establishment of successful communication in the defensive channel. Therefore, the second channel capacity is proportional to the defending capacity of the defending side, and thus, the magnitude of the defending capacity of the defending side is represented by the value of the second channel capacity in the present embodiment.

Specifically, a first joint probability distribution between a first input variable and an output variable of an attack channel is determined according to the probability of success of the attack and the probability of success of the defending. First mutual information between the first input variable and the output variable is determined based on the first joint probability distribution. A first channel capacity is determined based on the first mutual information. The size of the first channel capacity may be used to represent the size of the attack-side attack capability.

And determining second joint probability distribution between a second input variable and an output variable of the defending channel according to the probability of successful attack and the probability of successful defending. And determining second mutual information between the second input variable and the output variable according to the second joint probability distribution. A second channel capacity is determined based on the second mutual information. The size of the second channel capacity may be used to represent the size of the defending-side defending capability.

Step 103, determining optional model parameters enabling defending capability to be larger than attack capability, and determining target model parameters from the optional model parameters.

In this embodiment, for the attack side, the first model parameters are determined, and the attack ability is determined by the second model parameters of the defending side. Accordingly, the second model parameters are determined for the defending side, and the defending ability is determined by the first model parameters of the attack side. That is, in the attack-side attack capability representation, the second model parameter is a variable, and the first model parameter is a first fixed value. Similarly, in the defending capability representation of the defending side, the first model parameter is a variable, and the second model parameter is a second fixed value.

Based on the above, when determining the target model parameters, first, the first fixed value and the second fixed value are equalized. The attack capacity of the attack side can be expressed as a first variation curve with the second model parameters as variables. The defending ability of the defending side can be expressed as a second variation curve taking the first model parameter as a variable. And determining optional model parameters which enable the defending capability to be larger than the attack capability according to the first change curve and the second change curve. Then, among the selectable model parameters, the smallest selectable model parameter is selected as the target model parameter of the present embodiment.

And 104, building a de-identified model according to the target model parameters.

And determining the size of the equivalent group in the de-identification model according to the target model parameters, and establishing the de-identification model.

Specifically, for a K-anonymous model, if the target model parameter is K, determining the size of the peer group in the K-anonymous model as K; for the L-diversity model, if the target model parameter is L, the size of the equivalent group in the L-diversity model is determined to be L, and for the T-access model, if the target model parameter is T, the size of the equivalent group in the T-access model is determined to be T. Of course, other de-identified models may be included, and the method is the same as that described above, and will not be repeated.

And 105, de-identifying the target data by using the de-identification model.

First, the target data is grouped according to the size of the equivalent group in the de-identified model.

Specifically, under each data attribute in the target data, the data with the same and similar values are divided into the same equivalent group. The number of data records in each equivalent group is determined by the size of the equivalent group.

And then, de-labeling the data corresponding to the target attribute in each group of data according to the grouping result to obtain de-labeled target data.

In this embodiment, an optional manner is to generalize the data according to the value range of the data corresponding to the target attribute in each equivalent group, so as to obtain the de-identified target data.

For example, when each data corresponding to the age attribute is 14, 11, 10, 15 in an equivalent group of size 4, the generalization processing is performed on each data, and the obtained de-identified target data are [10-15], and [10-15].

Alternatively, the data is masked according to the data corresponding to the target attribute in each equivalent group, so as to obtain de-identified target data.

For example, when each data corresponding to the id number attribute is 145864199602270020, 115248199805260247, 105428189506120451, 155856200612030020, the obtained de-identified target data is 145864, 0020, 115248, 0247, 105428, 0451, 155856, 0020, respectively.

In this embodiment, according to the first model parameter and the second model parameter, the probability of success of attack on the attack side and the probability of success of defending on the defending side are determined, so as to construct an attack channel and a defending channel. And determining the attack capacity of the attack side according to the first channel capacity of the attack channel. And determining the defending capacity of the defending side according to the second channel capacity of the defending channel. The target model parameters are determined from among the selectable model parameters that make defensive power greater than attack power. And establishing a de-identification model according to the target model parameters, and de-identifying the target data. The rationality of the model parameter setting of the de-identification model is improved, so that the de-identification model can complete de-identification of target data, plays a role in hiding the data, and can ensure the usability of the target data.

In another embodiment of the present application, the establishment of the attack channel and the defense channel is further described.

First, a first input variable and a second input variable are determined.

The attack event on the attack side is taken as a variable X, namely a first input variable. When the attack side blindness is evaluated as successful, it is denoted as x=1, and when the blind evaluation is evaluated as failed, it is denoted as x=0. The defending event on the defending side is taken as a variable Y, namely a second input variable. When the defending side blindness evaluation is defending success, the defending side blindness evaluation is marked as Y=1, and when the blindness evaluation is defending failure, the defending side blindness evaluation is marked as Y=0.

Then, a variable Z, i.e., an output variable, is constructed according to the correspondence between the first input variable and the second input variable. And forming an attack channel and a defending channel by using the output variable simulation.

Specifically, a first input variable is taken as an input end, an output variable is taken as an output end, and an attack channel is formed in a simulation mode. When the attack on the attack side succeeds, the attack channel is considered to be successful in delivering the input-side information to the output, i.e., event { z=0, x=0 } or event { z=1, x=1 } occurs. At this time, the defending side defending is considered to fail, i.e., y=0.

When the second input variable is used as an input end and the output variable is used as an output end, a defending channel can be formed in a simulation mode. When the defending side defending succeeds, the defending channel is considered to successfully transfer the input end information to the output end, namely, the event occurrence { z=0, y=0 } or the event { z=1, y=1 } occurs. At this time, the attack side attack is considered to fail, i.e., x=0.

In yet another embodiment of the present application, a specific method for determining an attack capability of an attack side and a defending capability of a defending side is provided.

FIG. 2 is a flow chart of another embodiment of the method for de-identifying information of the present application. As shown in the figure, in this embodiment, the steps of determining the attack capability of the attack side and the defending capability of the defending side are as follows:

step 201, determining a first joint probability distribution and a second joint probability distribution according to the probability of attack success and the probability of defending success.

And determining a first joint probability distribution between a first input variable and an output variable of the attack channel and a second joint probability distribution between a second input variable and an output variable of the defending channel according to the probability of attack success and the probability of defending success.

After the offender and the leapfrog are sufficiently long to fight, the probability distribution and joint probability distribution of the first input variable and the second input variable can be obtained as follows:

P _S (attack success) =p _S (X＝1)＝p

P _S (attack failure) =p _S (X＝0)＝1-p

P _S (defending success) =p _S (Y＝1)＝q

P _S (defending failure) =p _S (Y＝0)＝1-q

P _S (attack success, defense success) =p _S (X＝1,Y＝1)＝a

P _S (attack success, defense failure) =p _S (X＝1,Y＝0)＝b

P _S (attack failure, defense success) =p _S (X＝0,Y＝1)＝c

P _S (attack failure, defense failure) =p _S (X＝0,Y＝0)＝d

Note that a+b+c+d=1.

For an attack side, the 2 x 2 order transition probability matrix of the attack channel is: a= [ a (x, z)]＝[P _S (z|x)](x, z=0 or 1), then there is:

therefore, the transition matrix of the attack channel, which is composed of the first input variable and the output variable, is:

then, a first joint probability distribution (X, Z) between the first input variable and the output variable of the attack channel is:

P _S (X＝0,Z＝0)＝P _S (X＝0,Y＝0)＝d

P _S (X＝0,Z＝1)＝P _S (X＝0,Y＝1)＝c

P _S (X＝1,Z＝0)＝P _S (X＝1,Y＝1)＝a

P _S (X＝1,Z＝1)＝P _S (X＝1,Y＝0)＝b

for the defending side, the 2 x 2 order transition probability matrix of the defending channel is: b= [ B (y, z)]＝[P _S (z|y)](y, z=0 or 1) then is:

therefore, the transition matrix of the guarded channel constituted by the second input variable, the output variable is:

then, a second joint probability distribution (Y, Z) between the second input variable and the output variable of the guarded channel is:

P _S (Y＝0,Z＝0)＝P _S (X＝0,Y＝0)＝d

P _S (Y＝0,Z＝1)＝P _S (X＝1,Y＝0)＝b

P _S (Y＝1,Z＝0)＝P _S (X＝1,Y＝1)＝a

P _S (Y＝1,Z＝1)＝P _S (X＝0,Y＝1)＝c

step 202, determining first mutual information and second mutual information according to the first joint probability distribution and the second joint probability distribution.

And respectively determining first mutual information between the first input variable and the output variable and second mutual information between the second input variable and the output variable according to the first joint probability distribution and the second joint probability distribution.

For the attack side, according to the first joint probability distribution, the first mutual information between the first input variable and the output variable is obtained as follows:

For the defending side, according to the second joint probability distribution, obtaining second mutual information between the second input variable and the output variable is as follows:

step 203, determining the attack capability and the defending capability according to the first mutual information and the second mutual information.

Finally, optionally, according to the first mutual information of the attack side, determining the first channel capacity of the attack side, and further determining the attack capacity of the attack side according to the first channel capacity. And determining a second channel capacity of the defending side according to the second mutual information of the defending side, and further determining defending capacity of the defending side according to the second channel capacity.

For the attack side, the first channel capacity is the maximum of the first mutual information. Namely:

first channel capacity c=i _max (X, Z) the attack ability of the attack side can be determined from the above equation.

For the defending side, the second channel capacity is the maximum of the second mutual information. Namely:

second channel capacity f=i _max (Y, Z) from which the defending ability of the defending side can be determined.

In this embodiment, when determining the attack capability and the defending capability, the attack capability may also be determined by an average value of the first mutual information on the attack side; the defensive ability is determined by the average value of the second mutual information of the defensive side.

In another embodiment of the present application, a specific method for determining selectable model parameters and determining target model parameters from the selectable model parameters is provided.

From the above embodiments, it can be known that C is the first channel capacity of the attack channel, i.e. the attack capability of the attack side; f is the second channel capacity of the defending channel, i.e. defending capacity of the defending side. When C is less than F, the attack capacity is less than the defending capacity; when C is more than F, the attack capacity is more than the defending capacity; when c=f, the attack capability is equivalent to the defending capability.

In a particular de-identified model, the values of p, q, a, b, c, d in the above embodiments may be determined by the model parameters of the de-identified model. Therefore, when the model parameters of the de-identified model are determined, the attack capacity of the attack side and the defending capacity of the defending side under the current model parameters can be calculated.

Therefore, the model parameters with defending capability larger than attack capability can realize effective de-identification of the de-identified model, and the security is higher, so that the de-identified model can be used as the optional model parameters.

In view of protecting the authenticity of data in the de-identified model, the one of the plurality of selectable model parameters with the smallest value may be the target model parameter of the present embodiment. So that the size of the equivalent group in the de-identification model is as small as possible, when the data in the equivalent group is de-identified, for example, when the data in the equivalent group is subjected to generalization, the generalization range of the data is correspondingly reduced, and therefore the authenticity and the usability of the data are protected to the greatest extent.

Of course, when determining the target model parameters, the larger one of the selectable model parameters can be used as the target model parameter according to actual requirements, so that the defending capability is far greater than the attack capability, and the safety of the de-identified data is protected to the greatest extent.

In this embodiment, the model parameters that make the defending capability greater than the attack capability are determined as optional model parameters, so that the de-identification model is guaranteed to be capable of completing effective de-identification of the target data, and the capability of resisting the re-identification attack is improved. Meanwhile, the smallest one of the selectable model parameters is used as the target model parameter, so that the size of the equivalent group is as small as possible, and the authenticity of the data can be ensured to the greatest extent when the data in the equivalent group is de-identified.

In another embodiment of the present application, a specific implementation process for implementing information de-identification by using the information de-identification method of the present application is provided.

This embodiment takes the k-anonymity model as an example, and a specific implementation procedure is given.

For each data in an equivalence group, the probability of re-identification, i.e., the probability of success of an attack on the attack side, is equal to 1 divided by the size of its equivalence group. Since in the k-anonymity model the model parameter k represents the size of its equivalent group, the probability of attack success = 1/k. The larger the model parameter k is, the larger the size of the equivalent group is, and accordingly, the smaller the probability of success of attack on the attack side is.

In the present embodiment, K ₁ K is the first model parameter of the attack side ₂ Is a second model parameter on the defending side. For the k-anonymity model, the individual event probabilities are as follows:

P _S (attack success) =p _S (X＝1)＝1/K ₁

P _S (attack failure) =p _S (X＝0)＝1-1/K ₁

P _S (defending success) =p _S (Y＝1)＝1-1/K ₂

P _S (defending failure) =p _S (Y＝0)＝1/K ₂

P _S (attack success, defense success) =p _S (X＝1,Y＝1)＝1/K ₁ (1-1/K ₂ )＝a

P _S (attack success, defense failure) =p _S (X＝1,Y＝0)＝1/K ₁ K ₂ ＝b

P _S (attack failure, defense success) =p _S (X＝0,Y＝1)＝(1-1/K ₁ )(1-1/K ₂ )＝c

P _S (attack failure, defense failure) =p _S (X＝0,Y＝0)＝1/K ₂ (1-1/K ₁ )＝d

And constructing a random variable Z as an output variable, and simulating an attack event at the attack side as a first input variable and the output variable to form an attack channel. And taking the defending event at the defending side as a second input variable and simulating with an output variable to form a defending channel.

Wherein, when constructing the random variable Z, Z should be made to satisfy the conditions in the following Table 1:

TABLE 1

X	Y	Z
			0	1	1
0	0	0
			1	1	0
1	0	1

That is, when the event { z=0, x=0 } or the event { z=1, x=1 } occurs, the information communication in the attack channel is considered to be successful, and the defending side defending fails at this time, and the second input variable y=0. When the event { z=0, y=0 } or the event { z=1, y=1 } occurs, the information communication in the defending channel is considered to be successful, and at this time, the attack on the attack side fails, and the first input variable x=0.

In this embodiment, when constructing the random variable Z, z= (x+y) mod 2, that is, the sum obtained by adding the value of the first input variable X to the value of the second input variable Y and the sum obtained by adding 2 to the value of the second input variable Y is subjected to a remainder operation, so as to obtain the random variable Z. At this time, the random variable Z satisfies the conditions in table 1.

For an attack channel, its transfer matrix is:

the first joint probability distribution (X, Z) between the first input variable and the output variable of the attack channel is:

P _S (X＝0,Z＝0)＝P _S (X＝0,Y＝0)＝d

P _S (X＝0,Z＝1)＝P _S (X＝0,Y＝1)＝c

P _S (X＝1,Z＝0)＝P _S (X＝1,Y＝1)＝a

P _S (X＝1,Z＝1)＝P _S (X＝1,Y＝0)＝b

therefore, the first mutual information between the first input variable and the output variable is:

for the defending channel, the transfer matrix is:

the second joint probability distribution (Y, Z) between the second input variable and the output variable of the gatekeeper channel is:

P _S (Y＝0,Z＝0)＝P _S (X＝0,Y＝0)＝d

P _S (Y＝0,Z＝1)＝P _S (X＝1,Y＝0)＝b

P _S (Y＝1,Z＝0)＝P _S (X＝1,Y＝1)＝a

P _S (Y＝1,Z＝1)＝P _S (X＝0,Y＝1)＝c

therefore, the second mutual information between the second input variable and the output variable is:

the first mutual information and the second mutual information represent attack capability and defending capability respectively, and in an actual process, the attack capability and defending capability are determined by the value of a model parameter k of the k-anonymity model.

Fig. 3 is a graph of average mutual information amount variation of an attack channel in the information de-identification method of the present application.

As shown in fig. 3, for the attack side, its attack ability is determined by the second model parameter K of the defender side ₂ Value determination, i.e. first model parameter K on the attack side ₁ Under the condition of definite value, the attack capability of the system is along with a second model parameter K on the defending side ₂ The value changes.

Fig. 4 is a graph of average mutual information amount change of a guard channel in the method for de-identifying information according to the present application.

As shown in fig. 4, for the defending side, the defending ability is defined by a first model parameter K of the attacking side ₁ Value determination, i.e. second model parameter K on defending side ₂ Under the condition of definite value, the defending ability of the method is along with a first model eucalyptus K on the attack side ₁ The value changes.

Thus, when the first model parameter K on the attack side ₁ Second model parameter K on the value and defending side ₂ When the values are the same, the model parameters which enable the defending capacity to be larger than the attack capacity can be determined as the selectable model parameters by combining the change curve of the attack capacity of the attack side and the change curve of the defending capacity of the defending side.

Fig. 5 is a graph showing the combination of attack and defending capabilities in the method for de-identifying information according to the present application.

As shown in FIG. 5, the two curves are respectively, when the second model parameter K ₂ When=50, the change curve of the first mutual information on the attack side, namely the change curve of the attack capability; when the first model parameter K ₁ When=50, the second mutual information change curve on the defending side, i.e. the defending ability change curve.

As can be seen from fig. 5, in this example, when the model parameter takes a value of 8, the attack capability and the defending capability are equivalent. Thus, model parameters greater than 8 are taken as optional model parameters. At this time, the defending capability is greater than the attacking capability.

In this embodiment, in order to ensure the authenticity of the de-identified data to the greatest extent, the smallest one of the selectable model parameters is taken as the target model parameter, i.e., 9 is taken as the target model parameter.

When the model parameter of the k-anonymity model takes a value of 9, the size of the equivalent group is 9, i.e. for each data attribute, at least 9 records are included in one equivalent group. Thus, at least 9 data that are the same or similar in each data attribute are grouped as a group according to the size of the equivalent group.

After the grouping is completed, the data corresponding to each data attribute in each equivalent group is subjected to de-identification processing, and a specific processing mode may include: masking, generalizing, deleting, etc., to obtain de-identified target data.

Fig. 6 is a schematic structural diagram of an embodiment of an information de-identifying apparatus in the present application, where the information de-identifying apparatus in the present embodiment may be used as an information de-identifying device to implement the information de-identifying method provided in the embodiment of the present application. As shown in fig. 6, the information de-identification apparatus may include: a determining module 51, a model building module 52, a de-identification module 53.

A determining model 51, configured to determine a probability of success of attack on the attack side according to the first model parameter, and determine a probability of success of defending on the defending side according to the second model parameter; according to the probability of successful attack and the probability of successful defending, determining the attack capacity of the attack side and the defending capacity of the defending side; optional model parameters that make defensive power greater than attack power are determined, and target model parameters are determined from the optional model parameters.

In specific implementation, an attack channel is simulated between a first input variable and an output variable of the attack side, and a defending channel is simulated between a second input variable and an output variable of the defending side.

The determining module 51 determines a first joint probability distribution between a first input variable and an output variable of the attack channel according to the probability of success of the attack and the probability of success of the defending, further determines first mutual information between the first input variable and the output variable, and determines a first channel capacity based on the first mutual information. And determining second joint probability distribution between a second input variable and an output variable of the defending channel according to the probability of successful attack and the probability of successful defending, further determining second mutual information between the second input variable and the output variable, and determining second channel capacity based on the second mutual information. The first mutual information and the second mutual information represent an attack capability of an attack side and a defending capability of a defending side respectively.

The determining module 51 is further configured to determine a model parameter that makes the defending capability greater than the attacking capability as an optional model parameter, and determine a target model parameter from the optional model parameter.

The model building module 52 is configured to build a de-identified model according to the target model parameters.

Specifically, the model establishment module 52 establishes the de-identified model by determining the size of the equivalent group in the de-identified model based on the target model parameters.

The de-identification module 53 is configured to de-identify the target data using the de-identification model. The method is particularly used for grouping the target data according to the size of the equivalent group in the de-identification model. And performing de-identification processing on the data corresponding to each attribute in each group of data according to the grouping result to obtain de-identified target data.

In the information de-identification device, the determining module 51 determines mutual information between the first input variable and the output variable of the attack channel according to the probability of success of attack on the attack side and the probability of success of defending on the defending side, so as to obtain the attack capability of the attack side; according to the probability of successful attack on the attack side and the probability of successful defending on the defending side, mutual information between a second input variable and an output variable of the defending channel is determined, and then defending capacity of the defending side is obtained. And taking the model parameters with defending capability larger than attack capability as optional model parameters, and determining target model parameters from the optional model parameters. The model building module 52 builds a de-identification model according to the target model parameters determined by the determining module, and the de-identification module 53 performs de-identification processing on the target data according to the obtained de-identification model. On one hand, the defending capability is larger than the attack capability, so that the security of the de-identified data can be ensured, and the risk of re-identification is reduced; on the other hand, on the basis of ensuring the safety, the equivalent set of the de-identified model is made as small as possible, and the authenticity and the usability of the data can be ensured to the greatest extent.

FIG. 7 is a schematic structural diagram of one embodiment of an electronic device of the present application, as shown in FIG. 7, which may include at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method for identifying information provided by the embodiments of the present application.

The electronic device may be an information de-identifying device, and the specific form of the electronic device is not limited in this embodiment.

Fig. 7 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device shown in fig. 7 is only an example, and should not be construed as limiting the functionality and scope of use of the embodiments herein.

As shown in fig. 7, the electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors 410, a memory 430, and a communication bus 440 that connects the various system components (including the memory 430 and the processing unit 410).

The communication bus 440 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.

Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 430 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) and/or cache memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Although not shown in fig. 7, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to communication bus 440 by one or more data medium interfaces. Memory 430 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.

A program/utility having a set (at least one) of program modules may be stored in the memory 430, such program modules including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules generally perform the functions and/or methods in the embodiments described herein.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may occur through communication interface 420. Moreover, the electronic device may also communicate with one or more networks (e.g., local area network (Local Area Network; hereinafter: LAN), wide area network (Wide Area Network; hereinafter: WAN) and/or a public network, such as the Internet) via a network adapter (not shown in FIG. 7) that may communicate with other modules of the electronic device via the communication bus 440. It should be appreciated that although not shown in fig. 7, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, disk arrays (Redundant Arrays of Independent Drives; hereinafter RAID) systems, tape drives, data backup storage systems, and the like.

The processor 410 executes various functional applications and data processing by running programs stored in the memory 430, for example, to implement the information de-identification method provided in the embodiments of the present application.

The embodiment of the application also provides a non-transitory computer readable storage medium, which stores computer instructions that cause the computer to execute the information de-identification method provided by the embodiment of the application.

The non-transitory computer readable storage media described above may employ any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory; EPROM) or flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

It should be noted that, the terminal according to the embodiments of the present application may include, but is not limited to, a personal Computer (Personal Computer; hereinafter referred to as a PC), a personal digital assistant (Personal Digital Assistant; hereinafter referred to as a PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for de-identifying information, comprising:

determining the probability of success of attack on the attack side according to the first model parameters, and determining the probability of success of defending on the defending side according to the second model parameters;

determining the attack capacity of the attack side and the defending capacity of the defending side according to the probability of successful attack and the probability of successful defending;

determining optional model parameters enabling the defending capability to be larger than the attack capability, and determining target model parameters from the optional model parameters;

establishing a de-identification model according to the target model parameters;

de-labeling the target data by using the de-labeling model;

establishing a de-identification model according to the target model parameters, wherein the de-identification model comprises the following steps:

determining the size of an equivalent group in the de-identified model according to the target model parameters;

and establishing a de-identification model according to the determined size of the equivalent group.

2. The method of claim 1, wherein an attack channel is simulated between a first input variable and an output variable on the attack side, and a defense channel is simulated between a second input variable and an output variable on the defense side;

According to the probability of success of the attack and the probability of success of the defending, determining the attack capacity of the attack side and the defending capacity of the defending side comprises the following steps:

determining a first channel capacity of an attack channel according to the probability of success of the attack and the probability of success of the defending; determining the attack capacity of an attack side according to the first channel capacity;

determining a second channel capacity of a defending channel according to the probability of successful attack and the probability of successful defending; and determining the defending capacity of the defending side according to the second channel capacity.

3. The method of claim 2, wherein determining a first channel capacity of an attack channel based on the probability of success of the attack and the probability of success of the defender comprises:

determining a first joint probability distribution between the first input variable and the output variable of the attack channel according to the probability of success of the attack and the probability of success of the defending;

and determining first mutual information between a first input variable and an output variable according to the first joint probability distribution, and determining the first channel capacity based on the first mutual information.

4. The method of claim 2, wherein determining a second channel capacity of a defending channel based on the probability of attack success and the probability of defending success comprises:

Determining a second joint probability distribution between the second input variable and the output variable of the defending channel according to the probability of successful attack and the probability of successful defending;

and determining second mutual information between a second input variable and an output variable according to the second joint probability distribution, and determining the second channel capacity based on the second mutual information.

5. The method of claim 1, wherein the second model parameter is a variable and the first model parameter is a first fixed value in the representation of the attack capability; the first model parameter is a variable and the second model parameter is a second fixed value in the representation of the defensive capacity;

determining optional model parameters that make the defensive capability greater than the attack capability, comprising:

and when the first fixed value and the second fixed value take the same value, determining optional model parameters which enable the defending capability to be larger than the attack capability.

6. The method of claim 1, wherein de-identifying the target data using the de-identification model comprises:

grouping target data according to the size of the equivalent group in the de-identification model;

And de-identifying the data corresponding to the target attribute in each group of data according to the grouping result.

7. An information de-identification device, comprising:

the determining module is used for determining the probability of success of attack on the attack side according to the first model parameters and determining the probability of success of defending on the defending side according to the second model parameters; determining the attack capacity of the attack side and the defending capacity of the defending side according to the probability of successful attack and the probability of successful defending; determining optional model parameters enabling the defending capability to be larger than the attack capability, and determining target model parameters from the optional model parameters;

the model building module is used for building a de-identification model according to the target model parameters;

the de-identification module is used for de-identifying the target data by utilizing the de-identification model;

8. An electronic device, comprising:

at least one processor; and

At least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-6.

9. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the method of any one of claims 1 to 6.