[go: up one dir, main page]

CN111163056A - Data confidentiality method and system aiming at MapReduce calculation - Google Patents

Data confidentiality method and system aiming at MapReduce calculation Download PDF

Info

Publication number
CN111163056A
CN111163056A CN201911244325.3A CN201911244325A CN111163056A CN 111163056 A CN111163056 A CN 111163056A CN 201911244325 A CN201911244325 A CN 201911244325A CN 111163056 A CN111163056 A CN 111163056A
Authority
CN
China
Prior art keywords
key
reduce
task
stage
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911244325.3A
Other languages
Chinese (zh)
Other versions
CN111163056B (en
Inventor
王永智
张小宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201911244325.3A priority Critical patent/CN111163056B/en
Publication of CN111163056A publication Critical patent/CN111163056A/en
Application granted granted Critical
Publication of CN111163056B publication Critical patent/CN111163056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Storage Device Security (AREA)

Abstract

The invention belongs to the technical field of cloud computing data confidentiality, and discloses a data confidentiality method and system aiming at MapReduce computing, wherein Full Shuffle and Safe Shuffle are provided under a standard MapReduce computing framework; input data is sorted into a key-value key value pair form for preprocessing; merging the key values in each Map task of the Map sub-stage according to the key values; distributing the merged key value pair to each reduce task of a Reducing stage layer respectively; processing the data received by each Reduce task in the Reduce sub-stage; and enabling the output data size of each reduce task to be equal. The method and the device protect data and privacy based on the MapReduce framework in a remote execution environment scene, and avoid the data privacy of the application program from being obtained by a malicious observer through side channel attack.

Description

Data confidentiality method and system aiming at MapReduce calculation
Technical Field
The invention belongs to the technical field of cloud computing data confidentiality, and particularly relates to a data confidentiality method and system for MapReduce computing.
Background
Currently, the closest prior art: MapReduce is a parallel programming model, is used for the parallel computation of large-scale data set, has the characteristics of functional programming language and vector programming language, and has the functions of data division, computation task scheduling, system optimization, error detection and recovery, so that MapReduce is suitable for application programs such as log analysis, machine learning, distribution sequencing and the like. A MapReduce job is a unit of work that a user wishes to be performed: it includes input data, a MapReduce program and configuration information. MapReduce runs a job by dividing it into tasks. Tasks are divided into a map task (map task) and a reduce task (reduce task). The standard MapReduce data flow of the multi-Reduce task is composed of stages such as fragments, Map and Reduce. Each map task in MapReduce can be subdivided into 4 phases: recordread (for data splitting), map, combine (for data aggregation, this stage may be omitted), partition (for data splitting). Each reduce task in Hadoop can be subdivided into 4 stages: shuffle, sort, reduce, and output format.
Hadoop is an implementation of the MapReduce framework. The software platform is a software platform for developing and operating large-scale data, is an open source software framework realized by Apache in java language, and realizes distributed calculation of mass data by a cluster consisting of a large number of computers. Hadoop has the advantages of high efficiency, low cost, strong capacity expansion capability and reliability. The most core design of the Hadoop framework is as follows: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computation for massive data.
SGX, fully known as Intel Software Guard Extensions, is a set of x86-64ISA Extensions that can set up a protected execution environment (called Enclave) without requiring any trust other than the code that the processor and user place in its Enclave. Once the software and data are in the Enclave, even the operating system or vmm (hypervisor) cannot affect the code and data inside the Enclave. The security boundary for Enclave contains only the CPU and itself. Enclave is protected by the processor: the processor controls access to the Enclave memory. An instruction attempting to read from or write to the memory of the executing Enclave from outside the Enclave will fail. The Enclave cache line is encrypted and integrity protected before being written to memory (RAM). Enclave code may be called from untrusted code through a callgate calling mechanism similar to the Intel x86 architecture, which transfers control to a user-defined entry point within the Enclave. SGX supports remote authentication, which enables a remote system to cryptographically verify whether particular software has been loaded in secure zone Enclave and establish end-to-end encrypted channel shared secrets.
Cloud computing is the development of grid computing, distributed processing and parallel processing, can be regarded as the realization of business service modes on the computer science concepts, and is a server cluster which is used for computing and can provide super-large-scale computing resources. As a business service mode based on network computing, a cloud computing user can acquire storage space, computing capacity, software service and the like according to the needs of the user, and computing tasks are distributed in a resource pool formed by a large number of computers, so that the computing capacity of the user is not limited by the resources of the user, and the computing tasks with large loads are outsourced to the cloud to complete high-cost computing.
Although cloud computing has many advantages such as virtualization, on-demand service, high extensibility, and the like, a user puts applications, data, and the like on a cloud server, and certain risks are inevitably encountered, and it can be expected that a risk of privacy disclosure will be brought by relying on a cloud computing provider to process sensitive data. The credibility problem of the cloud service provider will seriously affect the effective use of the cloud service by the user.
The use of a public cloud infrastructure to store and process large datasets raises new security issues. Current solutions suggest encrypting all data and accessing it only in the clear within the secure hardware. Such as the VC3 system developed by microsoft corporation, which relies on SGX to protect the operation of local map tasks and reduce tasks, can leverage the popular Hadoop framework to ensure integrity and confidentiality. All data is encrypted by the system AES-GCM.
Even in VC3 systems protected with a secure environment, the distributed processing of large amounts of data still involves intensive encrypted communication between different processing and network storage units, and these modes of communication may leak sensitive information. Protecting only the individual elements of the distributed computation (e.g., the map and reduce elements) inevitably exposes the attacker to several leakage paths of important information. The data volume of the map and reduce jobs, which is visible to the cloud provider and to a lesser extent to other users, observes and associates a series of intermediate key-value pairs exchanged between each map and each reduce, sensitive information being learnt by the data volume size.
IN view of THE above problems, two schemes, SHUFFLE-IN-THE-MIDDLE, were proposed IN Microsoft's research paper Observing and preventingleakage IN MapReduce to prevent intermediate traffic analysis of jobs by safely shuffling all key-value pairs that give all map generation to all Reduce uses. However, the attacker still observes the number of records generated by each map task, the number of records received by the reduce task and the association distribution condition among the records, and the efficiency is low. The SHUFFLE & BALANCE scheme divides the preprocessing into an off-line stage and an on-line stage, and the off-line stage randomizes the sequence of input records to ensure that all map tasks generate the same key-value pair distribution. And sampling input data in an online stage, collecting statistical information of key value pairs generated by maps for balancing between the reduce, and estimating the upper limit of the number of the key value pairs sent to each reduce by each mapper according to probability, so that the intermediate flow sent by each map task is uniformly distributed to each reduce task to meet higher safety definition. But actually, the off-line phase of the scheme randomizes the sequence of input records, so that the maximum key value distribution values of two groups of input data sets with the same size are equal in the running process. This process of randomizing records is time consuming and results in offline inefficiencies and still suffers from insufficient security, such as the scheme may reveal relevant data set information (maximum key values).
In summary, the problems of the prior art are as follows:
1. the existing scheme has a strict requirement on the operation record of an experimental data set, which also causes the problem solving applicability of the prior art to be low.
2. Compared with a standard MapReduce framework, THE existing scheme has higher performance overhead, for example, THE SHUFFLE-IN-THE-MIDDLE scheme causes 191 to 205 percent of performance overhead, and THE SHUFFLE & BALANCE scheme causes 95 to 101 percent of performance overhead IN an online stage.
3. The existing scheme only supports the default Partition function and does not support the user-defined shuffle function.
4. The existing scheme has the precondition of accurate probability estimation of input data, and once the estimation is inaccurate, the scheme fails and unnecessary high-performance overhead is caused.
The difficulty of solving the technical problems is as follows: on the premise of ensuring safe and reliable operation of MapReduce, leakage of sensitive information of a data set is avoided to the maximum extent, the applicability of a scheme is improved, and meanwhile, good performance improvement is achieved. The optimal relationship between security and performance must be balanced to achieve the optimal effect of the solution, which is also a difficult point and challenge in design.
The significance of solving the technical problems is as follows: by solving the technical problem of complaint, the intermediate data stream privacy can be better protected under the original MapReduce framework, the side channel attack based on the intermediate data stream in the MapReduce framework in the untrusted public cloud is resisted, the performance overhead is optimized on the premise, and the novel system method is provided with the characteristic of not influencing the high efficiency of distributed computing in the cloud environment.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a data security method and system aiming at MapReduce calculation.
The invention is realized in such a way that a data security method aiming at MapReduce calculation comprises the following steps:
in the first step, two novel data shuffling schemes are proposed under the standard MapReduce calculation framework: full Shuffle and Safe Shuffle; three novel sub-phases are defined at Mapping phase level: a Map stage, a Combine stage and a Partition stage, wherein a Reduce sub-stage is defined in a Reducing stage layer;
secondly, in the Map sub-stage of the Mapping stage layer of the MapReduce, the system arranges the input data into a key-value key value pair form for preprocessing;
thirdly, merging the key value pairs in each Map task of the Map sub-stage according to the key value through the combination stage, and adding a false key value pair to the value through the number of key value types of the data set in advance so as to enable the size of the data output by each Map task to be equal, or adding a confusion key value pair on the basis of the previous step so as to enable the number of key value types of the real data set to be hidden;
fourthly, the Partition function of the Partition sub-stage distributes the merged key value pair to each Reduce task of the Reducing stage layer respectively so as to enable the size of data input by each Reduce task of the Reducing sub-stage to be equal;
fifthly, processing the data received by each Reduce task in the Reduce sub-stage, and discarding all the false key value pairs, the confusion key value pairs and the key value pair data which do not belong to the combination of each Reduce task in the Reduce sub-stage;
and sixthly, at the final output storage DFS stage (clearUp function) of each Reduce task in the Reduce sub-stage, adding the confusion key value pair again according to the number of the key value types of the data set so as to enable the data output by each Reduce task to be equal in size.
Further, the second step specifically includes: in a Map sub-phase of a Mapping phase layer of MapReduce, a Map main function arranges input data into a format of a (t.key, t.value) key value pair for preprocessing; the input data is uniformly distributed to each Map task, namely, each Map task has data input with the size of | D |/M, the input data is decrypted and then read into the form of an intermediate key value pair < K, V > by rows through Map _ function; the input data set is represented by | D |, | represents the size of the data after encryption processing, M represents the number of Map tasks, and Map _ function is a Map main function in each Map task.
Further, the third step specifically includes: deploying Combiner _ function in each Map task, and when the Map _ function generates an output encryption key value pair | t |, the Combiner _ function firstly decrypts and then merges key value pairs with the same key value by using Reduce _ function; wherein, the Reduce _ function is a Reduce main function in each Reduce task, and the Combiner _ function is a main function of the combination sub-stage; as a result, a merged key-value pair group is generated in each map task, where the key value of each key-value pair is unique, in the key-value pair format (K)authentic,Vauthentic) The key value pair group is called as a credible key value pair group;
a map task mapiGenerates a trusted keyValue pair group Tm={<K1,V1>,…,<K|Tm|,V|Tm|>When | Tm<K, m will generate Dm=K-|TmL false key value pairs<Kd1,Vd1>,…,<KdDm,VdDm>(ii) a Wherein mapi∈{map1,…,mapMThe task subscript mU (1, M), K is the type number of keys in the original data set, | TmL is the number of key value pairs in the group of the credible key values; the generated pseudo-key value pair has the format of (K)dummy,Vdummy) In which K isdummyAnd VdummyThe virtual false value is a preset virtual false value with a mark; finally, in the Partition sub-phase, there will be | T's in each map taskmI trusted key-value pairs and DmThe number of the false-key-value pairs,<K1,V1>,…,<KK,VK>encrypted and distributed to each Reduce task of the Reduce sub-stage; adding a certain number of obfuscated key-value pairs<Kf1,Vf1>,…,<KfK’,VfK’>So as to hide the number K of key value types of the real data set; wherein the number of obfuscated key-value pairs is denoted by K', and the key-value pair format also uses dummy values with special labels (K)fake,Vfake) Represents; finally, in the Partition sub-phase, there will be | T's in each map taskmI trusted key-value pairs, DmA dummy key-value pair and K' obfuscated key-value pairs,<K1,V1>,…,<KK+K’,VK+K’>each Reduce task distributed to the Reduce sub-phase is encrypted.
Further, the fourth step specifically includes: the Partition function of the Partition sub-stage distributes the merged key value pair to each Reduce task of the Reduce sub-stage respectively so as to enable the size of data input by each Reduce task of the Reduce sub-stage to be equal; in the Partition function of the Partition sub-stage, each encryption key value pair | t | >, is first defined<t.key,t.value>Decrypting, and then fully distributing the key value pair t through a Full distribution function Full _ Partition _ function; wherein the Full _ Partition _ function is performed R times for each key-value pair t,thereby sequentially generating an allocation rule of Full _ Partition _ function (t.key) ═ R (1, 2.., R); let the Reduce task in the Reduce sub-phase be Reducej∈{reduce1,…,reduceRFor the Full Shuffle scheme, each task reducejWill receive K ═ Dm+|TmI encryption key value pairs including a trusted key value pair group (K)authentic,Vauthentic) And a false key value pair (K)dummy,Vdummy) Or K + K' encryption key value pairs including a trusted key value pair group (K) are received for the Safe Shuffle schemeauthentic,Vauthentic) False key value pair (K)dummy,Vdummy) And obfuscating key-value pairs (K)fake,Vfake)。
Further, the fifth step specifically includes: each Reduce of the Reduce sub-phases is performed by a Partition function of the Partition sub-phasejThe task will receive each mapiEncryption key value pair { | t { (a) that task transmitsiDecrypting the key value pair | t |, and judging each obtained key value pair t ═ t.key, t.value); if t.key ═ KdummyOr KfakeThen task reducejThese key-value pairs will be discarded, at each task reducejProcessing the credible key value pair by referring to a User-defined distribution function User _ Partition _ function in the main function Reduce _ function; if User _ Partition _ function (t.key) ═ j, then the key-value pair is retained, otherwise the key-value pair is discarded; wherein for reducejDetermining the cut of a key value pair by judging the value of R and the current reduce task j; then the reserved key value pairs are processed by the Reduce task main function Reduce _ function in sequence, the key value pairs with the same key value are grouped and combined, and each Reduce is used as a resultjA set of key value pairs with unique key value is generated in the task and is called as a set T of credible key value pairsmMarked as (K)authentic,Vauthentic)。
Further, the sixth step specifically includes: for the Full Shuffle protocol, each reducejCredible key value pair group T generated by taskmWhen its number | Tm|<At K, r will generate Dr=K-|TmL false key value pairs<Kd1,Vd1>,…,<KdDr,VdDr>(ii) a Wherein the subscripts of the tasks are rU (1, R), K is the number of types of keys in the original data set, | TmL is the number of key value pairs in the group of the credible key values; the generated pseudo-key value pair has the format of (K)dummy,Vdummy) In which K isdummyAnd VdummyThe virtual false value is a preset virtual false value with a mark; finally, in the stage of storing DFS, there will be | T in each reduce taskmI trusted key-value pairs and DrA false key value pair, i.e.<K1,V1>,…,<KK,VK>Encrypted for storage in the DFS;
for the Safe Shuffle scheme, each reducejCredible key value pair group T generated by taskmWhen its number | Tm|<At K, r will generate Dr=K+K’-|TmL false key value pairs<Kd1,Vd1>,…,<KdDr,VdDr>(ii) a Wherein the subscripts of the tasks are rU (1, R), K is the number of types of keys in the original data set, | TmL is the number of key value pairs in the group of the credible key values; the generated pseudo-key value pair has the format of (K)dummy,Vdummy) In which K isdummyAnd VdummyThe virtual false value is a preset virtual false value with a mark; finally, in the stage of storing DFS, there will be | T in each reduce taskmI trusted key-value pairs and DrThe number of the false-key-value pairs,<K1,V1>,…,<KK+K’,VK+K’>encrypted for storage in the DFS.
Another object of the present invention is to provide a data privacy system for MapReduce computing implementing the method for MapReduce computing, including:
the Map sub-phase processing module is used for preprocessing input data in a form of key-value key value pairs in the Map sub-phase of the Mapping phase layer of MapReduce by the system;
the combination sub-stage processing module is used for merging the key value pairs in each Map task of the Map sub-stage according to the key value, adding the false key value pairs to the key value according to the number of the key value types of the data set in advance so as to enable the size of the data output by each Map task to be equal, or adding the confusion key value pairs on the basis of the above to enable the number of the key value types of the real data set to be hidden;
the Partition sub-stage processing module is used for distributing the merged key value pairs to each Reduce task of the Reducing stage layer by the Partition function of the Partition sub-stage so as to enable the size of data input by each Reduce task of the Reducing sub-stage to be equal;
the Reduce sub-stage processing module 4is used for processing the data received by each Reduce task of the Reduce sub-stage, and discarding all the false key value pairs, the confusion key value pairs and the key value pair data which do not belong to the combination of each Reduce task of the Reduce sub-stage;
and the Reduce sub-stage end processing module is used for outputting and storing a DFS (clean Up function) stage at the last of each Reduce task in the Reduce sub-stage, and adding the confusion key value pair again according to the number of the key value types of the data set so as to enable the data output by each Reduce task to be equal in size.
Another object of the present invention is to provide an information data processing terminal implementing the data privacy method for MapReduce calculation.
Another object of the present invention is to provide a computer-readable storage medium, comprising instructions, which when executed on a computer, cause the computer to perform the data security method for MapReduce computation.
The invention further aims to provide a cloud computing data security system applying the data security method for MapReduce computing.
In summary, the advantages and positive effects of the invention are: the invention provides two novel data shuffling schemes under a standard MapReduce calculation framework: full Shuffle and Safe Shuffle; three novel sub-phases are defined at Mapping phase level: a Map stage, a Combine stage and a Partition stage, wherein a Reduce sub-stage is defined in a Reducing stage layer; in the Map sub-stage of the Mapping stage layer of the MapReduce, the system arranges input data into a key-value key value pair form for preprocessing; merging the key value pairs in each Map task of the Map sub-stage according to the key value through a combination stage, and adding a false key value pair to the value through the number of the key value types of the data set in advance so as to enable the size of the data volume output by each Map task to be equal, or adding a confusion key value pair on the basis of the above to enable the key value types of the real data set to be hidden; the Partition function in the Partition stage distributes the merged key value pairs to each Reduce task of the Reducing stage layer respectively so as to enable the size of data input by each Reduce task in the Reducing sub-stage to be equal; processing the data received by each Reduce task in the Reduce sub-stage, and discarding all the false key value pairs, confusion key value pairs and key value pair data which do not belong to the combination of each Reduce task in the Reduce sub-stage; and in the final output storage DFS stage (clearUp function) of each Reduce task in the Reduce sub-stage, adding the confusion key value pair again according to the number of the key value types of the data set so as to enable the data output by each Reduce task to be equal in size. The data security method and the data security system can carry out data security on data in the operation process of the user operation in the cloud computing platform.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention relates to a data security method and a system aiming at MapReduce calculation.A specific full distribution function is written in a partition function, the distribution fully distributes data in a map task to reduce tasks, the quantity and the size of data output by all the map tasks are equal, and the quantity is mixed by containing false data, so that the statistical distribution relation between the quantity of the data and input data is not clear, an attacker cannot speculate the data by tracking the flow from each map task to the reduce task, namely, the attacker cannot distinguish the output corresponding relation of the input with the same data size by observing the input, thereby realizing the indistinguishability of map output.
(2) According to the data security method and system for MapReduce calculation, the data volume of receiving, outputting and storing of each reduce task is equal, the data volume does not have the speculation significance after statistics, an attacker is prevented from using different data for tracking the corresponding relation between the map task and the reduce task for multiple times, and therefore the indistinguishability of reduce input is achieved.
The method and the device protect data and privacy based on the MapReduce framework in a remote execution environment scene, and avoid the data privacy of the application program from being acquired by a malicious observer through side channel attack.
Drawings
FIG. 1 is a flowchart of a data security method for MapReduce computation according to an embodiment of the present invention.
FIG. 2 is a schematic structural diagram of a data security system for MapReduce computation according to an embodiment of the present invention;
in the figure: 1. a Map sub-stage processing module; 2. a combination sub-phase processing module; 3. a Partition sub-stage processing module; 4. a Reduce sub-stage processing module; 5. reduce sub-phase end processing module.
FIG. 3 is a diagram of an original MapReduce framework provided by an embodiment of the invention.
FIG. 4is a diagram of a Full Shuffle protocol, which is a protocol provided by an embodiment of the present invention.
FIG. 5 is a diagram of a second Safe Shuffle protocol provided in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a data security method and a data security system for MapReduce calculation, and the invention is described in detail with reference to the attached drawings.
As shown in fig. 1, the data security method for MapReduce computation provided by the embodiment of the present invention includes the following steps:
s101, a sub-stage adding step, wherein two novel data shuffling schemes are proposed under a standard MapReduce calculation framework: full Shuffle and Safe Shuffle; three novel sub-phases are defined at Mapping phase level: a Map stage, a Combine stage and a Partition stage, wherein a Reduce sub-stage is defined in a Reducing stage layer;
s102, Map sub-phase processing, namely, in a Map sub-phase of a Mapping phase layer of the MapReduce, preprocessing input data in a key-value key value pair mode by a system;
s103, a combination sub-stage processing step, namely merging the key value pair in each Map task of the Map sub-stage according to the key value through the combination stage, and adding a false key value pair to the value through the number of key value types of the data set in advance so as to enable the size of the data volume output by each Map task to be equal, or adding a confusion key value pair on the basis of the above, so as to enable the number of key value types of the real data set to be hidden;
s104, a Partition sub-stage processing step, wherein a Partition function of the Partition sub-stage distributes the merged key value pairs to each Reduce task of a Reducing stage layer respectively so as to enable the size of data input by each Reduce task of the Reducing sub-stage to be equal;
s105, a Reduce sub-stage processing step, namely processing the data received by each Reduce task in the Reduce sub-stage, and discarding all false key value pairs, confusion key value pairs and key value pair data which do not belong to the combination of each Reduce task in the Reduce sub-stage;
s106, a Reduce sub-stage end processing step, namely, a DFS stage (clearUp function) is output and stored at the last of each Reduce task in the Reduce sub-stage, and the confusion key value pairs are added again according to the number of the key value types of the data sets, so that the data output by each Reduce task are equal in size.
As shown in fig. 2, the data security system for MapReduce computation provided by the embodiment of the present invention includes:
the Map sub-phase processing module 1 is used for preprocessing input data in a form of key-value key value pairs in the Map sub-phase of the Mapping phase layer of the MapReduce by a system;
the combination sub-stage processing module 2 is used for merging the key value pairs in each Map task of the Map sub-stage according to the key value, and adding a false key value pair to the key value according to the number of key value types of the data set in advance so as to enable the size of the data volume output by each Map task to be equal, or adding a confusion key value pair on the basis of the above, so as to enable the number of key value types of the real data set to be hidden;
the Partition sub-stage processing module 3 is used for the Partition function of the Partition sub-stage to distribute the merged key value pair to each Reduce task of the Reducing stage layer so as to enable the size of data input by each Reduce task of the Reducing sub-stage to be equal;
the Reduce sub-stage processing module 4is used for processing the data received by each Reduce task of the Reduce sub-stage, and discarding all the false key value pairs, the confusion key value pairs and the key value pair data which do not belong to the combination of each Reduce task of the Reduce sub-stage;
and the Reduce sub-stage end processing module 5 is used for outputting and storing a DFS (clean Up function) stage at the last of each Reduce task in the Reduce sub-stage, and adding the confusion key value pair again according to the number of the key value types of the data set so as to enable the data output by each Reduce task to be equal in size.
The technical solution of the present invention is further described below with reference to the accompanying drawings.
When the method is specifically implemented, before the MapReduce calculation data is kept secret, the encryption operation needs to be carried out on the MapReduce data. Specifically, data encryption is established on the basis that the MapReduce framework runs in a safe execution environment. The secure execution Environment may be implemented using Trusted Execution Environment (TEE) technology, such as Intel SGX. MapReduce generally decomposes job into tasks, which are divided into a map task and a reduce task, and the tasks are respectively executed by nodes in a cluster. According to the invention, each task is deployed in the trusted execution environment for execution, so that the task is kept secret during operation, but data still needs to be protected during transmission among different tasks. The invention carries out encryption transmission on the data between tasks.
Because the secure execution environment only contains codes of each task processing data in the MapReduce, such as a map task and a reduce task of the standard MapReduce, and the rest Hadoop distributed infrastructure does not need trust, the plaintext of the encrypted data cannot be directly obtained by an attacker in the operation stage.
Although the encryption operation is carried out on the data, the clear text of the data can be ensured not to be directly obtained and modified by an attacker in the running stage. However, after the above processing, the malicious observer may still record the exchange of encrypted data, such as data exchange between each node in the MapReduce system (network traffic analysis) or data exchange between each node and the storage (storage traffic analysis), where the exchanged data amount includes bytes, pages, packets, or records. An observer obtains the statistical distribution of input data on the basis of prior statistical knowledge, and therefore sensitive information in the data is obtained by observing the flow between the map task and the reduce task to analyze, and privacy is leaked.
The data security method of the invention keeps the data secret from the aspects of the indistinguishability of map task input and output and the indistinguishability of reduce task input and output: indistinguishability of map task input: in the preprocessing stage of the job task, the size of an original data set is averagely divided into M sample data sets with the same size, the sample data sets are respectively sent to M different map tasks to be processed, the input data volume and the size of each map task are the same, and an attacker is prevented from observing the corresponding relation from the storage unit DFS to each map task.
Indistinguishability of map task output: and writing a specific full distribution function in the partition function, wherein the distribution averagely distributes the data in the map tasks to the reduce tasks, the quantity and the size of the data output by all the map tasks are equal, so that the statistical distribution relation between the quantity of the data and the input data is not clear, an attacker cannot speculate the data by tracking the flow from each map task to the reduce task, namely, the attacker cannot distinguish the output corresponding relation of the inputs with the same data size by observing the inputs.
Indistinguishability of reduce task input: the data volume received by each reduce task is equal in size and does not have the speculation significance after statistics, and an attacker is prevented from using different data for tracking the corresponding relation between the map task and the reduce task for multiple times.
Indistinguishability of reduce task output: and writing a storage function of a specific rule in the CleanUp function of each reduce task, wherein the rule unifies the output quantity and the size of each reduce task and the output quantity and the size of each map task, so that an attacker cannot guess the inherent characteristics of the data by tracking the flow of each reduce task to the DFS.
Specifically, in order to implement data security from the aspects of the indistinguishability of map task input and output and the indistinguishability of reduce task input and output, the present invention is implemented by the following two embodiments.
Example 1
As shown in fig. 4, compared to the standard MapReduce process (as shown in fig. 3, the standard MapReduce framework), the present embodiment mainly defines three new sub-phases at the Mapping phase level: the Map phase, the combination phase and the Partition phase, wherein the Reduce sub-phase is defined at the Reducing phase layer. The scheme conforms to the indistinguishability of input and output of map tasks, and the sizes of the data volume received and stored by each reduce task which conforms to the indistinguishability of input and output of the reduce end are equal.
The data set input used by a User to submit a job is set as D, | D | represents the size of input data, | represents the size after data encryption processing, M represents the number of Map tasks, R represents the number of Reduce tasks, Map _ function is a Map master function in each Map task, Reduce _ function is a User master function in each Reduce task, a User-defined distribution function er _ Partition _ function and a Full distribution function Full _ Partition _ function are set, and Combiner _ function is a master function of a combination sub-phase. The following describes the processing of Map sub-phase, combination sub-phase, Partition own phase and Reduce sub-phase, respectively.
Map sub-phase: the Map main function Map _ function preprocesses input data into a format of t ═ key value pairs, wherein the input data are uniformly distributed to each Map task, namely, each Map task has data input with the size of | D |/M, the input data are decrypted and then read into a form of intermediate key value pairs < K, V > by lines through the Map _ function
Combination sub-phase: deploying Combiner _ function in each Map task, when the Map _ function generates output encryption key value pair | t |, the Combiner _ function firstly decrypts, then uses Reduce _ function to combine key value pairs with the same key value, as a result, a combined key value pair group is generated in each Map task, wherein the key value of each key value pair is unique, the invention uses key value pair format (K)authentic,Vauthentic) The key value pair group is called as a credible key value pair group. Suppose a map task mapiGenerating credible key value pair group Tm={<K1,V1>,…,<K|Tm|,V|Tm|>When | Tm<K, m will generate Dm=K-|TmL false key value pairs<Kd1,Vd1>,…,<KdDm,VdDm>(ii) a Wherein mapi∈{map1,…,mapMThe task subscript mU (1, M), K is the type number of keys in the original data set, | TmL is the number of key value pairs in the group of the credible key values; the generated pseudo-key value pair has the format of (K)dummy,Vdummy) In which K isdummyAnd VdummyThe virtual false value is a preset virtual false value with a mark; finally, in the Partition sub-phase, there will be | T's in each map taskmI trusted key-value pairs and DmA false key value pair, i.e.<K1,V1>,…,<KK,VK>And each Reduce task encrypted and distributed to the Reduce sub-phase is called Full Shuffle in the scheme.
Partition sub-stage: the partition function distributes the merged key value pair to each Reduce task of the Reduce sub-stage respectively so as to enable the size of data input by each Reduce task of the Reduce sub-stage to be equal; in the Partition function of the Partition sub-stage, each encryption key value pair | t | >, is first defined<t.key,t.value>Decrypting, and then fully distributing the key value pair t through a Full distribution function Full _ Partition _ function; wherein each key-value pair t is executedRow R times Full _ Partition _ function, thereby sequentially generating an allocation rule of Full _ Partition _ function (t.key) ═ R (1, 2.., R); let the Reduce task in the Reduce sub-phase be Reducej∈{reduce1,…,reduceRFor this scheme, each task reducejWill receive K ═ Dm+|TmI encryption key value pairs including a trusted key value pair group (K)authentic,Vauthentic) And a false key value pair (K)dummy,Vdummy)。
Reduce sub-phase: each Reduce of the Reduce sub-phases is performed by a Partition function of the Partition sub-phasejThe task will receive each mapiEncryption key value pair { | t { (a) that task transmitsiDecrypting the key value pair | t |, and judging each obtained key value pair t ═ t.key, t.value); if t.key ═ KfakeThen task reducejThese key-value pairs will be discarded, at each task reducejIn the main function Reduce _ function, the invention processes the credible key value pair by referring to a User-defined distribution function User _ Partition _ function; if User _ Partition _ function (t.key) ═ j, then the key-value pair is retained, otherwise the key-value pair is discarded; wherein for reducejDetermining the cut of a key value pair by judging the value of R and the current reduce task j; then the reserved key value pairs are processed by the Reduce task main function Reduce _ function in sequence, the key value pairs with the same key value are grouped and combined, and each Reduce is used as a resultjA group of key value pair sets with unique key values are generated in the task, and the key value pair sets are also called as credible key value pair groups T in the inventionmMarked as (K)authentic,Vauthentic)。
End of Reduce sub-phase: for this scheme, each reducejCredible key value pair group T generated by taskmWhen its number | Tm|<At K, r will generate Dr=K-|TmL false key value pairs<Kd1,Vd1>,…,<KdDr,VdDr>(ii) a Wherein the task subscripts rU (1, R), K are originalNumber of types, | T, of keys in a datasetmL is the number of key value pairs in the group of the credible key values; the generated pseudo-key value pair has the format of (K)dummy,Vdummy) In which K isdummyAnd VdummyThe virtual false value is a preset virtual false value with a mark; finally, in the stage of storing DFS, there will be | T in each reduce taskmI trusted key-value pairs and DrA false key value pair, i.e.<K1,V1>,…,<KK,VK>Encrypted for storage in the DFS.
Specifically, the present embodiment may be implemented as a standard jobe (one mapreduce task is called jobe) in practice. The present embodiment utilizes a uniform transmission to secure data. When two different data sets with known K types and the same size run on the scheme, the input and output of the map task and the input and output of the reduce task are equal in flow in the process monitored by an observer.
Example 2
The embodiment realizes data confidentiality by adding the false data and protects the number K of the key values. The embodiment accords with the indistinguishability of input and output of map tasks after the MapReduce is rewritten, and the size of the data volume received by each Reduce task which accords with the indistinguishability of input and output of the Reduce end is equal.
As shown in fig. 5, compared to the standard MapReduce process (as shown in fig. 3, the standard MapReduce framework), the present embodiment mainly defines three new sub-phases at the Mapping phase level: the Map phase, the combination phase and the Partition phase, wherein the Reduce sub-phase is defined at the Reducing phase layer. The scheme conforms to the indistinguishability of input and output of map tasks, and the sizes of the data volume received and stored by each reduce task which conforms to the indistinguishability of input and output of the reduce end are equal.
The data set input used by a User to submit a job is set as D, | D | represents the size of input data, | represents the size after data encryption processing, M represents the number of Map tasks, R represents the number of Reduce tasks, Map _ function is a Map master function in each Map task, Reduce _ function is a User master function in each Reduce task, a User-defined distribution function er _ Partition _ function and a Full distribution function Full _ Partition _ function are set, and Combiner _ function is a master function of a combination sub-phase. The following describes the processing of Map sub-phase, combination sub-phase, Partition own phase and Reduce sub-phase, respectively.
Map sub-phase: the Map main function Map _ function preprocesses input data into a format of t ═ key value pairs, wherein the input data are uniformly distributed to each Map task, namely, each Map task has data input with the size of | D |/M, the input data are decrypted and then read into a form of intermediate key value pairs < K, V > by lines through the Map _ function
Combination sub-phase: deploying Combiner _ function in each Map task, when the Map _ function generates output encryption key value pair | t |, the Combiner _ function firstly decrypts, then uses Reduce _ function to combine key value pairs with the same key value, as a result, a combined key value pair group is generated in each Map task, wherein the key value of each key value pair is unique, the invention uses key value pair format (K)authentic,Vauthentic) The key value pair group is called as a credible key value pair group. In particular, for safety, the scheme can add a certain number of confusion key-value pairs on the basis of the Full Shuffle scheme<Kf1,Vf1>,…,<KfK’,VfK’>So as to hide the number K of key value types of the real data set; wherein the number of obfuscated key-value pairs is denoted by K', and the key-value pair format also uses dummy values with special labels (K)fake,Vfake) Represents; finally, in the Partition sub-phase, there will be | T's in each map taskmI trusted key-value pairs, DmA false key-value pair and K' obfuscated key-value pairs, i.e.<K1,V1>,…,<KK+K’,VK+K’>And each Reduce task encrypted and distributed to the Reduce sub-phase is called Safe Shuffle in the scheme.
Partition sub-stage: the partition function distributes the merged key-value pairs to Reduce sub-orders respectivelyEach Reduce task of the segment, so that the size of data input by each Reduce task of the Reduce sub-stage is equal; in the Partition function of the Partition sub-stage, each encryption key value pair | t | >, is first defined<t.key,t.value>Decrypting, and then fully distributing the key value pair t through a Full distribution function Full _ Partition _ function; wherein for each key-value pair t, Full _ Partition _ function is performed R times, thereby generating an allocation rule of Full _ Partition _ function (t.key) ═ R (1, 2., R) in turn; let the Reduce task in the Reduce sub-phase be Reducej∈{reduce1,…,reduceRFor this scheme, each task reducejWill receive K + K' (K ═ D)m+|Tm|) encryption key-value pairs, including pairs of trusted key-value pairs (K)authentic,Vauthentic) False key value pair (K)dummy,Vdummy) And obfuscating key-value pairs (K)fake,Vfake)。
Reduce sub-phase: each Reduce of the Reduce sub-phases is performed by a Partition function of the Partition sub-phasejThe task will receive each mapiEncryption key value pair { | t { (a) that task transmitsiDecrypting the key value pair | t |, and judging each obtained key value pair t ═ t.key, t.value); if t.key ═ KdummyOr KfakeThen task reducejThese key-value pairs will be discarded, at each task reducejIn the main function Reduce _ function, the invention processes the credible key value pair by referring to a User-defined distribution function User _ Partition _ function; if User _ Partition _ function (t.key) ═ j, then the key-value pair is retained, otherwise the key-value pair is discarded; wherein for reducejDetermining the cut of a key value pair by judging the value of R and the current reduce task j; then the reserved key value pairs are processed by the Reduce task main function Reduce _ function in sequence, the key value pairs with the same key value are grouped and combined, and each Reduce is used as a resultjA group of key value pair sets with unique key values are generated in the task, and the key value pair sets are also called as credible key value pair groups T in the inventionmMarked as (K)authentic,Vauthentic)。
End of Reduce sub-phase: for this scheme, when the number | Tm|<At K, r will generate Dr=K+K’-|TmL false key value pairs<Kd1,Vd1>,…,<KdDr,VdDr>(ii) a Wherein the subscripts of the tasks are rU (1, R), K is the number of types of keys in the original data set, | TmL is the number of key value pairs in the group of the credible key values; the generated pseudo-key value pair has the format of (K)dummy,Vdummy) In which K isdummyAnd VdummyThe virtual false value is a preset virtual false value with a mark; finally, in the stage of storing DFS, there will be | T in each reduce taskmI trusted key-value pairs and DrA false key value pair, i.e.<K1,V1>,…,<KK+K’,VK+K’>Encrypted for storage in the DFS.
This embodiment is also implemented as a standard joba. The jobs reserves a complete mapreduce process, the maps of the jobs 2 are only simple copy processes, and the scheme realizes data confidentiality by adding false data in the transmission process of map tasks and reduce tasks, so that different key quantities are protected from being leaked. When two different data sets with the same size run on the scheme, the input and output of the map task and the input and output of the reduce task are still completely equal. In the embodiment, the aliasing false data is added as a constant K', and the aliasing false data and the original real data can be well distributed in the customized Partition function, so that the flow from the map task to each path of the reduce task is equal.
The technical effects of the present invention will be described in detail with reference to experiments.
Experimental results show that the scheme of the invention generates limited performance overhead, sometimes even faster than standard MapReduce runs, that for all experimental tasks the Full Shuffle scheme is 2.5% faster than the standard MapReduce scheme, and that when the number of confusion key pairs added is 8,16,32, the Safe Shuffle scheme is 0.8% faster, 4.9% slower and 11.6% slower than the standard MapReduce scheme, respectively. Compared with THE SHUFFLE-IN-THE-MIDDLE scheme and THE SHUFFLE & BALANCE scheme, THE performance is improved by 212.7 percent and 107.5 percent respectively.
Figure BDA0002307108810000181
Figure BDA0002307108810000191
It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1.一种针对MapReduce计算的数据保密方法,其特征在于,所述针对MapReduce计算的数据保密方法包括以下步骤:1. a data security method for MapReduce computing, is characterized in that, the described data security method for MapReduce computing comprises the following steps: 第一步,在标准MapReduce计算框架下提出两种新型数据混洗方案:Full Shuffle和Safe Shuffle;在Mapping阶段层定义三个新型子阶段:Map阶段、Combine阶段以及Partition阶段,在Reducing阶段层定义Reduce子阶段;In the first step, two new data shuffling schemes are proposed under the standard MapReduce computing framework: Full Shuffle and Safe Shuffle; three new sub-stages are defined in the Mapping stage layer: Map stage, Combine stage and Partition stage, which are defined in the Reducing stage layer. Reduce sub-phase; 第二步,在MapReduce的Mapping阶段层的Map子阶段中,系统将输入数据整理为key-value键值对的形式进行预处理;In the second step, in the Map sub-stage of the Mapping stage layer of MapReduce, the system organizes the input data into key-value pairs for preprocessing; 第三步,通过Combine阶段,将Map子阶段每一个map任务中的键值对根据key值进行合并,并通过预先的数据集key值种类数目添加假键值对到该值,以使每一个map任务输出的数据量大小相等,或者,在前述基础上添加混淆键值对,以使隐藏真实数据集key值种类数目;The third step is to combine the key-value pairs in each map task in the Map sub-stage according to the key value through the Combine stage, and add false key-value pairs to the value through the pre-set number of key value types in the dataset, so that each The amount of data output by the map task is equal in size, or, on the basis of the above, add obfuscated key-value pairs to hide the number of key value types in the real data set; 第四步,Partition子阶段的partition函数会将合并后的键值对分别分发给Reducing阶段层的每一个reduce任务,以使Reduce子阶段的每一个reduce任务输入的数据大小相等;In the fourth step, the partition function of the Partition sub-stage distributes the merged key-value pairs to each reduce task in the Reducing stage layer, so that the input data size of each reduce task in the Reduce sub-stage is equal; 第五步,对Reduce子阶段的每个reduce任务接受到的数据进行处理,丢弃所有假键值对、混淆键值对以及不属于Reduce子阶段各个reduce任务合并的键值对数据;The fifth step is to process the data received by each reduce task in the Reduce sub-stage, and discard all false key-value pairs, obfuscated key-value pairs, and key-value pair data that are not merged by each reduce task in the Reduce sub-stage; 第六步,在Reduce子阶段每个reduce任务的最后输出存储DFS阶段(cleanUp函数),根据数据集key值种类数目再次添加混淆键值对,以使每个reduce任务输出的数据大小相等。In the sixth step, in the Reduce sub-stage, the final output of each reduce task is stored in the DFS stage (cleanUp function), and the obfuscated key-value pairs are added again according to the number of types of key values in the data set, so that the output data size of each reduce task is equal. 2.如权利要求1所述的针对MapReduce计算的数据保密方法,其特征在于,所述第二步具体包括:MapReduce的Mapping阶段层的Map子阶段中,map主函数将输入的数据整理为t=(t.key,t.value)键值对的格式进行预处理;其中输入的数据将会均匀地分配给每一个map任务,即每个map任务会有|D|/M大小的数据输入,对于输入的数据先解密,然后通过Map_function按行读取成中间键值对<K,V>的形式;其中输入数据集用|D|表示,|.|表示对数据加密处理后的大小,M表示map任务数量,Map_function为每个map任务中的map主函数。2. The data security method for MapReduce calculation as claimed in claim 1, wherein the second step specifically comprises: in the Map sub-stage of the Mapping stage layer of MapReduce, the map main function organizes the input data into t =(t.key,t.value) key-value pair format is preprocessed; the input data will be evenly distributed to each map task, that is, each map task will have |D|/M data input , first decrypt the input data, and then read it in the form of an intermediate key-value pair <K, V> by line through Map_function; where the input data set is represented by |D|, |.| represents the size of the encrypted data, M represents the number of map tasks, and Map_function is the map main function in each map task. 3.如权利要求1所述的针对MapReduce计算的数据保密方法,其特征在于,所述第三步具体包括:在每个map任务中部署Combiner_function,当Map_function生成输出加密键值对|t|时,Combiner_function会先进行解密,然后利用Reduce_function合并拥有相同key值的键值对;其中Reduce_function为每个reduce任务中的reduce主函数,Combiner_function为Combine子阶段的主函数;作为结果,每一个map任务中会生成合并的键值对组,其中每一个键值对的key值都是独一无二的,用键值对格式(Kauthentic,Vauthentic)称该键值对组为可信键值对组;3. The data security method for MapReduce calculation as claimed in claim 1, wherein the third step specifically comprises: deploying Combiner_function in each map task, when Map_function generates an output encryption key-value pair |t| , Combiner_function will first decrypt, and then use Reduce_function to merge key-value pairs with the same key value; where Reduce_function is the reduce main function in each reduce task, Combiner_function is the main function of the Combine sub-stage; as a result, in each map task A merged key-value pair will be generated, in which the key value of each key-value pair is unique, and the key-value pair format (K authentic , V authentic ) is used to call the key-value pair a trusted key-value pair; 一个map任务mapi产生了可信键值对组Tm={<K1,V1>,…,<K|Tm|,V|Tm|>},当|Tm|<K,m将会生成Dm=K-|Tm|个假键值对<Kd1,Vd1>,…,<KdDm,VdDm>;其中mapi∈{map1,…,mapM},任务下标mU(1,M),K为原始数据集中key的种类数量,|Tm|为可信键值对组中键值对的数量;生成的假键值对的格式为(Kdummy,Vdummy),其中Kdummy和Vdummy是预先设定好的带有标记的虚拟假值;最后在Partition子阶段,每个map任务中将会有|Tm|个可信键值对以及Dm个假键值对,<K1,V1>,…,<KK,VK>,被加密分发给Reduce子阶段的每个reduce任务;添加一定数量的混淆键值对<Kf1,Vf1>,…,<KfK’,VfK’>,以使隐藏真实数据集key值种类数目K;其中混淆键值对的数量用K’表示,键值对格式同样使用带有特殊标记的虚拟值(Kfake,Vfake)表示;最后在Partition子阶段,每个map任务中将会有|Tm|个可信键值对、Dm个假键值对以及K’个混淆键值对,<K1,V1>,…,<KK+K’,VK+K’>,被加密分发给Reduce子阶段的每个reduce任务。A map task map i produces a trusted key-value pair T m ={<K 1 ,V 1 >,...,<K |Tm| ,V |Tm| >}, when |T m |<K, m will be D m =K-|T m | false key-value pairs <K d1 ,V d1 >,…,<K dDm ,V dDm > will be generated; where map i ∈{map 1 ,…,map M }, under the task Mark mU(1,M), K is the number of key types in the original data set, |T m | is the number of key-value pairs in the trusted key-value pair group; the format of the generated dummy key-value pairs is (K dummy ,V dummy ), where K dummy and V dummy are pre-set dummy dummy values with labels; finally, in the Partition sub-phase, each map task will have |T m | trusted key-value pairs and D m Fake key-value pairs, <K 1 ,V 1 >,…,<K K ,V K >, are encrypted and distributed to each reduce task in the Reduce sub-stage; add a certain number of obfuscated key-value pairs <K f1 ,V f1 >,…,<K fK' ,V fK' >, to hide the number K of key value types in the real data set; the number of obfuscated key-value pairs is represented by K', and the key-value pair format also uses a special mark The dummy value (K fake , V fake ) represents; finally, in the Partition sub-phase, each map task will have |T m | trusted key-value pairs, D m fake key-value pairs and K' confusing key-value pairs Yes, <K 1 ,V 1 >,…,<K K+K' ,V K+K' >, are encrypted and distributed to each reduce task in the Reduce sub-stage. 4.如权利要求1所述的针对MapReduce计算的数据保密方法,其特征在于,所述第四步具体包括:Partition子阶段的partition函数会将合并后的键值对分别分发给Reduce子阶段的每一个reduce任务,以使Reduce子阶段的每一个reduce任务输入的数据大小相等;在Partition子阶段的partition函数中,先对每一个加密键值对|t|=<t.key,t.value>解密,之后通过全分配函数Full_Partition_function来进行完全分配键值对t;其中对于每个键值对t来说都会执行R次Full_Partition_function,从而依次生成Full_Partition_function(t.key)=r(1,2,...,R)的分配规则;令Reduce子阶段中的reduce任务为reducej∈{reduce1,…,reduceR},对于Full Shuffle方案,每个任务reducej都会接收到K=Dm+|Tm|个加密键值对,其中包括可信键值对组(Kauthentic,Vauthentic)和假键值对(Kdummy,Vdummy),或者对于Safe Shuffle方案将接收到K+K’个加密键值对,包括可信键值对组(Kauthentic,Vauthentic)、假键值对(Kdummy,Vdummy)以及混淆键值对(Kfake,Vfake)。4. The data security method for MapReduce calculation as claimed in claim 1, wherein the fourth step specifically comprises: the partition function of the Partition sub-stage will distribute the merged key-value pairs to the Reduce sub-stages respectively. For each reduce task, the input data size of each reduce task in the Reduce sub-stage is equal; in the partition function of the Partition sub-stage, first encrypt each key-value pair |t|=<t.key,t.value > Decryption, and then fully allocate the key-value pair t through the full allocation function Full_Partition_function; for each key-value pair t, the Full_Partition_function will be executed R times, thereby generating Full_Partition_function(t.key)=r(1,2, ..., R); let the reduce tasks in the Reduce sub-stage be reduce j ∈ {reduce 1 ,...,reduce R }, for the Full Shuffle scheme, each task reduce j will receive K = D m + |T m | encrypted key-value pairs, including trusted key-value pairs (K authentic , V authentic ) and dummy key-value pairs (K dummy , Vd ummy ), or for Safe Shuffle scheme will receive K+K' encryption key-value pairs, including trusted key-value pairs (K authentic , V authentic ), fake key-value pairs (K dummy , V dummy ) and obfuscated key-value pairs (K fake , V fake ). 5.如权利要求1所述的针对MapReduce计算的数据保密方法,其特征在于,所述第五步具体包括:通过Partition子阶段的partition函数,Reduce子阶段的每个reducej任务将会接收到各个mapi任务传来的加密键值对组{|t|i},先解密键值对|t|,对于得到的每一个键值对t=(t.key,t.value)进行判断;如果t.key=Kdummy或者t.key=Kfake,那么任务reducej将会丢弃这些键值对,在每一个任务reducej的主函数Reduce_function中,通过引用用户自定义的分配函数User_Partition_function来处理可信键值对;如果User_Partition_function(t.key)=j,则保留该键值对,否则丢弃该键值对;其中对于reducej任务下标r,函数User_Partition_function(t.key)=rU(1,R),通过判断r与当前reduce任务j的值来确定键值对的取舍;之后保留的键值对会依次通过reduce任务主函数Reduce_function的处理,将拥有相同key值的键值对进行分组合并,作为结果,每个reducej任务中将会生成一组key值唯一的键值对集合,称为可信键值对组Tm,标记为(Kauthentic,Vauthentic)。5. the data security method for MapReduce calculation as claimed in claim 1, is characterized in that, described 5th step specifically comprises: through the partition function of Partition sub-stage, each reduce j task of Reduce sub-stage will receive For the encrypted key-value pair {|t| i } sent from each map i task, first decrypt the key-value pair |t|, and judge each obtained key-value pair t=(t.key, t.value); If t.key=K dummy or t.key=K fake , then task reduce j will discard these key-value pairs, in the main function Reduce_function of each task reduce j , by referring to the user-defined allocation function User_Partition_function to process Trusted key-value pair; if User_Partition_function(t.key)=j, keep the key-value pair, otherwise discard the key-value pair; for the reduce j task subscript r, the function User_Partition_function(t.key)=rU(1 ,R), determine the choice of key-value pair by judging the value of r and the current reduce task j; then the reserved key-value pairs will be processed by the reduce task main function Reduce_function in turn, and the key-value pairs with the same key value will be grouped Combined, as a result, each reduce j task will generate a set of key-value pairs with unique key values, called a trusted key-value pair group T m , marked as (K authentic , V authentic ). 6.如权利要求1所述的针对MapReduce计算的数据保密方法,其特征在于,所述第六步具体包括:对于Full Shuffle方案而言,每个reducej任务生成的可信键值对组Tm,当其数量|Tm|<K时,r将会生成Dr=K-|Tm|个假键值对<Kd1,Vd1>,…,<KdDr,VdDr>;其中任务下标rU(1,R),K为原始数据集中key的种类数量,|Tm|为可信键值对组中键值对的数量;生成的假键值对的格式为(Kdummy,Vdummy),其中Kdummy和Vdummy是预先设定好的带有标记的虚拟假值;最后在存储DFS阶段,每个reduce任务中将会有|Tm|个可信键值对以及Dr个假键值对,即<K1,V1>,…,<KK,VK>,被加密存储到DFS中;6. The data security method for MapReduce calculation as claimed in claim 1, wherein the sixth step specifically comprises: for the Full Shuffle scheme, the trusted key-value pair group T generated by each reduce j task m , when its number |T m |<K, r will generate D r =K-|T m | false key-value pairs <K d1 ,V d1 >,…,<K dDr ,V dDr >; where The task subscript rU(1, R), K is the number of key types in the original data set, |T m | is the number of key-value pairs in the trusted key-value pair group; the format of the generated dummy key-value pairs is (K dummy ,V dummy ), where K dummy and V dummy are pre-set dummy dummy values with labels; finally in the storage DFS stage, each reduce task will have |T m | trusted key-value pairs and D r false key-value pairs, namely <K 1 ,V 1 >,…,<K K ,V K >, are encrypted and stored in DFS; 对于Safe Shuffle方案而言,每个reducej任务生成的可信键值对组Tm,当其数量|Tm|<K时,r将会生成Dr=K+K’-|Tm|个假键值对<Kd1,Vd1>,…,<KdDr,VdDr>;其中任务下标rU(1,R),K为原始数据集中key的种类数量,|Tm|为可信键值对组中键值对的数量;生成的假键值对的格式为(Kdummy,Vdummy),其中Kdummy和Vdummy是预先设定好的带有标记的虚拟假值;最后在存储DFS阶段,每个reduce任务中将会有|Tm|个可信键值对以及Dr个假键值对,<K1,V1>,…,<KK+K’,VK+K’>,被加密存储到DFS中。For the Safe Shuffle scheme, when the number of trusted key-value pairs T m generated by each reduce j task is |T m |<K, r will generate D r =K+K'-|T m | A false key-value pair <K d1 ,V d1 >,…,<K dDr ,V dDr >; where the task subscript rU(1, R), K is the number of key types in the original data set, |T m | The number of key-value pairs in the key-value pair group; the format of the generated dummy key-value pairs is (K dummy , V dummy ), where K dummy and V dummy are pre-set dummy dummy values with flags; finally In the storage DFS stage, each reduce task will have |T m | trusted key-value pairs and D r false key-value pairs, <K 1 ,V 1 >,…,<K K+K' ,V K+K' >, encrypted and stored in DFS. 7.一种实施权利要求1~6任意一项所述针对MapReduce计算的数据保密方法的针对MapReduce计算的数据保密系统,其特征在于,所述针对MapReduce计算的数据保密系统包括:7. A data security system for MapReduce computing that implements the data security method for MapReduce computing according to any one of claims 1 to 6, wherein the data security system for MapReduce computing comprises: Map子阶段处理模块,用于在MapReduce的Mapping阶段层的Map子阶段中,系统将输入数据整理为key-value键值对的形式进行预处理;The Map sub-stage processing module is used in the Map sub-stage of the Mapping stage layer of MapReduce, the system organizes the input data into the form of key-value key-value pairs for preprocessing; Combine子阶段处理模块,用于将Map子阶段每一个map任务中的键值对根据key值进行合并,并通过预先的数据集key值种类数目添加假键值对到该值,以使每一个map任务输出的数据量大小相等,或者,在前述基础上添加混淆键值对,以使隐藏真实数据集key值种类数目;The Combine sub-stage processing module is used to combine the key-value pairs in each map task in the Map sub-stage according to the key value, and add false key-value pairs to the value through the pre-set number of key value types in the data set, so that each The amount of data output by the map task is equal in size, or, on the basis of the above, add obfuscated key-value pairs to hide the number of key value types in the real data set; Partition子阶段处理模块,用于Partition子阶段的partition函数将合并后的键值对分别分发给Reducing阶段层的每一个reduce任务,以使Reduce子阶段的每一个reduce任务输入的数据大小相等;Partition sub-stage processing module, the partition function used in the Partition sub-stage distributes the merged key-value pair to each reduce task in the Reducing stage layer, so that the input data size of each reduce task in the Reduce sub-stage is equal; Reduce子阶段处理模块4,用于对Reduce子阶段的每个reduce任务接受到的数据进行处理,丢弃所有假键值对、混淆键值对以及不属于Reduce子阶段各个reduce任务合并的键值对数据;The Reduce sub-stage processing module 4 is used to process the data received by each reduce task in the Reduce sub-stage, and discard all false key-value pairs, obfuscated key-value pairs, and key-value pairs that do not belong to the merged key-value pairs of each reduce task in the Reduce sub-stage data; Reduce子阶段末处理模块,用于在Reduce子阶段每个reduce任务的最后输出存储DFS阶段(cleanUp函数),根据数据集key值种类数目再次添加混淆键值对,以使每个reduce任务输出的数据大小相等。The processing module at the end of the Reduce sub-stage is used to store the final output of each reduce task in the Reduce sub-stage in the DFS stage (cleanUp function). According to the number of types of key values in the data set, the obfuscated key-value pairs are added again, so that the output of each reduce task is The data size is equal. 8.一种实现权利要求1~6任意一项所述针对MapReduce计算的数据保密方法的信息数据处理终端。8. An information data processing terminal that implements the data security method for MapReduce computing according to any one of claims 1 to 6. 9.一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1~6任意一项所述的针对MapReduce计算的数据保密方法。9. A computer-readable storage medium, comprising instructions that, when executed on a computer, cause the computer to execute the data security method for MapReduce computing as claimed in any one of claims 1 to 6. 10.一种应用权利要求1~6任意一项所述针对MapReduce计算的数据保密方法的云计算数据保密系统。10. A cloud computing data security system applying the data security method for MapReduce computing according to any one of claims 1 to 6.
CN201911244325.3A 2019-12-06 2019-12-06 A data security method and system for MapReduce computing Active CN111163056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911244325.3A CN111163056B (en) 2019-12-06 2019-12-06 A data security method and system for MapReduce computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911244325.3A CN111163056B (en) 2019-12-06 2019-12-06 A data security method and system for MapReduce computing

Publications (2)

Publication Number Publication Date
CN111163056A true CN111163056A (en) 2020-05-15
CN111163056B CN111163056B (en) 2021-08-31

Family

ID=70555723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911244325.3A Active CN111163056B (en) 2019-12-06 2019-12-06 A data security method and system for MapReduce computing

Country Status (1)

Country Link
CN (1) CN111163056B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865909A (en) * 2020-06-08 2020-10-30 西安电子科技大学 SGX side channel attack defense method, system, medium, program and application
CN113934723A (en) * 2020-07-13 2022-01-14 华为技术有限公司 Key value pair processing method and data processing device
CN115794740A (en) * 2022-11-23 2023-03-14 中国农业银行股份有限公司 Data analysis method, system and server
CN119697173A (en) * 2024-12-10 2025-03-25 中国工商银行股份有限公司 Data transmission method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117286A (en) * 2015-09-22 2015-12-02 北京大学 Task scheduling and pipelining executing method in MapReduce
CN107025409A (en) * 2017-06-27 2017-08-08 中经汇通电子商务有限公司 A kind of data safety storaging platform
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 Data distribution method and device based on MapReduce and computer-readable storage medium
CN109145624A (en) * 2018-08-29 2019-01-04 广东工业大学 A kind of more chaos text encryption algorithms based on Hadoop platform
CN109684856A (en) * 2018-12-18 2019-04-26 西安电子科技大学 A kind of data encryption method and system for MapReduce calculating
US20190266195A1 (en) * 2010-12-28 2019-08-29 Microsoft Technology Licensing, Llc Filtering queried data on data stores

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190266195A1 (en) * 2010-12-28 2019-08-29 Microsoft Technology Licensing, Llc Filtering queried data on data stores
CN105117286A (en) * 2015-09-22 2015-12-02 北京大学 Task scheduling and pipelining executing method in MapReduce
CN107025409A (en) * 2017-06-27 2017-08-08 中经汇通电子商务有限公司 A kind of data safety storaging platform
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 Data distribution method and device based on MapReduce and computer-readable storage medium
CN109145624A (en) * 2018-08-29 2019-01-04 广东工业大学 A kind of more chaos text encryption algorithms based on Hadoop platform
CN109684856A (en) * 2018-12-18 2019-04-26 西安电子科技大学 A kind of data encryption method and system for MapReduce calculating

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YONGZHI WANG等: ""MtMR: Ensuring MapReduce Computation Integrity with Merkle Tree-Based Verifications"", 《IEEE TRANSACTIONS ON BIG DATA》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865909A (en) * 2020-06-08 2020-10-30 西安电子科技大学 SGX side channel attack defense method, system, medium, program and application
CN113934723A (en) * 2020-07-13 2022-01-14 华为技术有限公司 Key value pair processing method and data processing device
CN115794740A (en) * 2022-11-23 2023-03-14 中国农业银行股份有限公司 Data analysis method, system and server
CN119697173A (en) * 2024-12-10 2025-03-25 中国工商银行股份有限公司 Data transmission method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111163056B (en) 2021-08-31

Similar Documents

Publication Publication Date Title
US12292994B2 (en) Encrypting data records and processing encrypted records without exposing plaintext
US10171432B2 (en) Systems to implement security in computer systems
CN112005237B (en) Secure collaboration between processors and processing accelerators in a secure enclave
US10148442B2 (en) End-to-end security for hardware running verified software
US20230319023A1 (en) Network bound encryption for orchestrating workloads with sensitive data
CN111163056B (en) A data security method and system for MapReduce computing
Akram et al. Performance analysis of scientific computing workloads on general purpose tees
Park et al. Secure hadoop with encrypted HDFS
US11755753B2 (en) Mechanism to enable secure memory sharing between enclaves and I/O adapters
Wang et al. Toward scalable fully homomorphic encryption through light trusted computing assistance
US20220171883A1 (en) Efficient launching of trusted execution environments
GB2532415A (en) Processing a guest event in a hypervisor-controlled system
WO2020258727A1 (en) Data encryption method, apparatus and device, and medium
US20250335576A1 (en) Efficient launching of trusted execution environment
Wu et al. Exploring dynamic task loading in SGX-based distributed computing
Ashalatha et al. Network virtualization system for security in cloud computing
CN109684856B (en) A data security method and system for MapReduce computing
CN103885725A (en) Virtual machine access control system and method based on cloud computing environment
Yao et al. CryptVMI: A flexible and encrypted virtual machine introspection system in the cloud
CN114528545B (en) A data protection method, apparatus, device, and storage medium
Sathya Narayana et al. Trusted model for virtual machine security in cloud computing
Chu et al. Secure cryptography infrastructures in the cloud
Fan et al. SECCEG: A secure and efficient cryptographic co-processor based on embedded GPU system
Rashid et al. Randomly encrypted key generation algorithm against side channel attack in cloud computing
Cheng et al. Building your private cloud storage on public cloud service using embedded GPUs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant