Disclosure of Invention
In order to solve the technical problems, the data security management method applied to the big data management platform is provided, and the technical scheme solves the problems in the background technology.
In order to achieve the above purpose, the invention adopts the following technical scheme:
The data security management method applied to the big data management platform comprises the following steps:
Acquiring a big data real-time dynamic data set, dividing the data set, and adding the divided data set into an idle data processing queue;
Establishing a data threat identification model, detecting abnormal data risk threats of the platform in real time, and taking automatic treatment measures based on threat types;
Adopting a differential privacy technology for the data inquiry in storage, and adjusting the noise quantity during inquiry in real time according to the data risk threat situation;
Sensitive data screening and cache backup are carried out on the dynamic data based on abnormal data risk threat conditions;
And detecting the current processing speed and the accumulation of tasks to be processed, and automatically adjusting the resource allocation by combining the threat risk of the abnormal data.
Preferably, the acquiring the big data real-time dynamic data set, dividing the data set, and adding the divided data set into an idle data processing queue specifically includes:
Acquiring a dynamic data set in a set time window, setting the size of a data processing segmentation window based on the residual available resources of a server, and segmenting the dynamic data set according to the size of the segmentation window;
Acquiring the load conditions of all the data processing threads and the task accumulation quantity of the corresponding data processing task queues, combining the set initial load weight of each thread with the ratio of the task accumulation quantity of each processing thread to the thread processing rate, and comprehensively calculating the idle evaluation value of the data processing thread;
and after the idle evaluation values of all the processing threads are subjected to reverse order sequencing, selecting the segmented dynamic data and sequentially adding the segmented dynamic data into a task waiting queue of the data processing threads according to the sequencing result.
Preferably, the establishing a data threat identification model, detecting the abnormal data risk threat of the platform in real time, and taking automatic treatment measures based on the threat type specifically includes:
Acquiring historical processing data of a system platform as training data, classifying the training data based on set data classification, and performing type attribution batch marking on the data after the type classification;
Preprocessing and feature extraction are carried out on the data subjected to batch marking, a data classification feature vector is established, a convolutional neural network is selected, a data classification recognition model is trained based on training data subjected to feature extraction, and the trained model is deployed into a system;
Setting data classification threat identification division, and establishing a corresponding threat identification model of each set data classification, wherein the model specifically comprises the following steps:
setting a risk threat classification corresponding to the data classification based on a historical risk classification processing result of the current data classification;
retrieving data specifically related to risk investigation results in training data corresponding to the current data classification, marking the data as risk source data, and attributing and marking all the risk source data based on set threat classification;
Preprocessing and feature extraction are carried out on risk source data, a risk data threat classification feature vector is established, a convolutional neural network is selected, and a special risk threat recognition model of current data classification is trained based on the feature extracted risk source data;
Combining special risk threat identification models corresponding to all data classifications and then deploying the combined special risk threat identification models into a system;
setting a corresponding response strategy based on a historical risk threat processing mode, and establishing a risk threat-response strategy association relation;
Acquiring a data processing thread, carrying out data type recognition on real-time data being processed in the data processing thread through a data classification recognition model, selecting a corresponding special risk threat recognition model to carry out risk threat recognition based on a data type recognition result, and carrying out automatic response strategy selection and data risk threat processing according to the risk threat recognition result and a risk threat-response strategy association relationship.
Preferably, the querying the data in storage by adopting a differential privacy technology, and adjusting the noise amount in real time according to the threat situation of the data risk specifically includes:
Acquiring all table structures of a database, selecting a privacy statistical field, and performing risk real-time adjustment on a set privacy budget by using a Sigmoid function based on the identification hit rate of a risk threat identification model in a currently set segmentation window, wherein the specific expression is as follows:
;
wherein epsilon and epsilon' are respectively set privacy budget and adjusted privacy budget, and f is the hit rate of the risk threat identification model;
Setting the time for updating the monitoring window by the statistics field, obtaining the maximum batch size of the latest time window of the privacy statistics field, and calculating the noise scale parameter by combining the adjusted privacy budget epsilon';
Selecting a random value obeying Laplace distribution, generating Laplace noise according to the noise scale and the selected random value, wherein the specific expression is as follows:
;
Wherein N is generated Laplacian noise, sgn (U) is a sign of a random value, U is a random value, deltaW is a set insertion batch size of a privacy statistics field, and lambda is a noise scale parameter;
and applying the generated noise to the query process of the privacy statistics field to increase the privacy of the statistics data query.
Preferably, the sensitive data screening and cache backup of the dynamic data based on the abnormal data risk threat situation specifically includes:
Acquiring a data type identification result of real-time data being processed by a data processing thread, setting sensitive data classification and sensitive operation keywords, retrieving data matched with the sensitive operation keywords in the sensitive data classification, marking the whole data corresponding to the keywords as sensitive data, and calculating the proportion of the sensitive data to the total processed data;
acquiring the identification hit rate of a risk threat identification model in a currently set segmentation window, carrying out weighted summation on the sensitive data duty ratio and the risk threat hit frequency, and calculating a sensitive data risk evaluation value;
Setting a sensitive data risk evaluation value threshold, and storing all marked sensitive data into a cache for backup when the risk evaluation value is greater than or equal to the sensitive data risk evaluation value threshold;
If no abnormal interrupt occurs to the processing thread in the current task processing process, the cache backup is cleared after the task processing is finished, and if abnormal interrupt occurs, the breakpoint is recorded, and the cache backup data is restored and processed again.
Preferably, detecting the accumulation of the current processing speed and the task to be processed, and automatically adjusting the resource allocation in combination with the threat risk of the abnormal data specifically includes:
detecting the processing speed of all current data processing threads and accumulation of tasks to be processed, and recalculating idle evaluation values of all current data processing threads;
Acquiring sensitive data risk assessment values of all current data processing threads, taking the ratio of the idle assessment value to the sensitive data risk assessment value as a resource allocation tendency coefficient, calculating the average value of the resource allocation tendency coefficients of all the data processing threads, and marking the average value as a thread processing resource allocation reference line;
the computing resources and the storage resources of each processing thread are proportionally adjusted based on the deviation of the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line.
Further, a data security management system applied to a big data management platform is provided, for implementing the data security management method applied to the big data management platform, including:
the data acquisition and processing distribution module is used for dividing the dynamic data set according to the size of the segmentation window based on the available resource surplus of the server according to the dynamic data set in the acquisition set time window, calculating the idle evaluation value of the data processing thread, and sequentially adding the dynamic data set into the task waiting queue of the data processing thread according to the calculation result;
the data type and risk threat identification module is used for processing data according to the history of a system platform to serve as training data, training a data classification identification model, training a data classification special risk threat identification model based on the historical risk check processing result of data classification, simultaneously establishing a risk threat-response strategy association relation, carrying out type and risk threat identification on the current processing data, and carrying out automatic response strategy selection and data risk threat processing;
The database query safety module carries out risk real-time adjustment on the set privacy budget according to the identification hit rate of the risk threat identification model in the currently set segmentation window, calculates noise scale parameters based on the insertion batch size of the privacy statistic field and the adjusted privacy budget, then selects a random value, generates Laplace noise according to the noise scale and the selected random value, and applies the generated noise to the query process of the privacy statistic field;
The automatic sensitive data backup module calculates a sensitive data risk assessment value according to the proportion of the sensitive data to the total processing data and the weighted sum result of the identification hit rate of the risk threat identification model in the currently set segmentation window, and stores all marked sensitive data into a cache for backup when the risk assessment value is greater than or equal to the sensitive data risk assessment value threshold;
And the system resource allocation self-adaptive adjustment module is used for taking the average value of the ratio of the idle evaluation value to the sensitive data risk evaluation value of all the data processing threads as a thread processing resource allocation reference line, and adjusting the computing resource and storage of each processing thread in equal proportion based on the deviation of the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line.
Compared with the prior art, the invention has the beneficial effects that:
the method comprises the steps of obtaining a big data real-time dynamic data set, dividing the data set, adding the data set into an idle data processing queue, establishing a data threat identification model, detecting abnormal data risk threats of a platform in real time, adopting automatic countermeasures based on threat types, adopting a differential privacy technology to data inquiry in storage, adjusting noise quantity in inquiry in real time according to data risk threat conditions, conducting sensitive data screening and cache backup on dynamic data based on the abnormal data risk threat conditions, detecting current processing speed and task accumulation to be processed, and automatically adjusting resource allocation in combination with abnormal data threat risks.
The data security management method comprehensively considering the whole complete data processing life cycle of data processing distribution, data threat identification, database security inquiry, automatic sensitive data backup and resource allocation self-adaptive adjustment is realized, the efficient protection and management of mass data are met, and the security and privacy of the data in the transmission, storage and processing processes are ensured.
Detailed Description
The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.
Referring to fig. 1, the data security management method applied to a big data management platform includes:
Acquiring a big data real-time dynamic data set, dividing the data set, and adding the divided data set into an idle data processing queue;
Establishing a data threat identification model, detecting abnormal data risk threats of the platform in real time, and taking automatic treatment measures based on threat types;
Adopting a differential privacy technology for the data inquiry in storage, and adjusting the noise quantity during inquiry in real time according to the data risk threat situation;
Sensitive data screening and cache backup are carried out on the dynamic data based on abnormal data risk threat conditions;
And detecting the current processing speed and the accumulation of tasks to be processed, and automatically adjusting the resource allocation by combining the threat risk of the abnormal data.
Referring to fig. 2, a real-time dynamic data set of big data is acquired, and the data set is divided and then added into an idle data processing queue.
Acquiring a dynamic data set in a set time window, setting the size of a data processing segmentation window based on the residual available resources of a server, and segmenting the dynamic data set according to the size of the segmentation window;
in the segmentation process, each segmented data segment is required to be smaller than or equal to the set segment window size so as to ensure the integrity of the segmented data.
Acquiring the load conditions of all the data processing threads and the task accumulation quantity of the corresponding data processing task queues, combining the ratio of the task accumulation quantity of each processing thread to the thread processing speed, setting initial load weight of each thread, and taking the product of the ratio of the task accumulation quantity to the thread processing speed and the load weight as an idle evaluation value of the data processing thread;
and after the idle evaluation values of all the processing threads are subjected to reverse order sequencing, selecting the segmented dynamic data and sequentially adding the segmented dynamic data into a task waiting queue of the data processing threads according to the sequencing result.
The process aims to carry out sequential processing task distribution according to the thread processing capacity allowance, is beneficial to avoiding the situation that partial threads are overloaded or the threads are idle, and improves the utilization rate of system resources and the load balance level of the data processing process.
Referring to fig. 3, a data threat identification model is established, abnormal data risk threats of the platform are detected in real time, and automatic treatment measures are adopted based on threat types.
Acquiring historical processing data of a system platform as training data, classifying the training data based on set data classification, and performing type attribution batch marking on the data after the type classification;
Preprocessing and feature extraction are carried out on the data subjected to batch marking, a data classification feature vector is established, a convolutional neural network is selected, a data classification recognition model is trained based on training data subjected to feature extraction, and the trained model is deployed into a system;
The method comprises the steps of establishing a data classification feature vector, wherein the data classification feature vector comprises multi-dimensional feature extraction, including data generation frequency domain features, data length features and data regularity features;
Setting data classification threat identification division, and establishing a corresponding threat identification model of each set data classification, wherein the model specifically comprises the following steps:
setting a risk threat classification corresponding to the data classification based on a historical risk classification processing result of the current data classification;
retrieving data specifically related to risk investigation results in training data corresponding to the current data classification, marking the data as risk source data, and attributing and marking all the risk source data based on set threat classification;
Preprocessing and feature extraction are carried out on risk source data, a risk data threat classification feature vector is established, a convolutional neural network is selected, and a special risk threat recognition model of current data classification is trained based on the feature extracted risk source data;
The method comprises the steps of establishing a risk data threat classification feature vector, wherein the risk data threat classification feature vector comprises multidimensional feature extraction, including data generation frequency domain features, risk threat data keyword similarity features, data length features and data regularity features;
Combining special risk threat identification models corresponding to all data classifications and then deploying the combined special risk threat identification models into a system;
setting a corresponding response strategy based on a historical risk threat processing mode, and establishing a risk threat-response strategy association relation;
The method comprises the steps of acquiring a data processing thread, carrying out data type recognition on real-time data being processed in the data processing thread through a data classification recognition model, selecting a corresponding special risk threat recognition model to carry out risk threat recognition based on a data type recognition result, and carrying out automatic response policy selection and data risk threat processing according to a risk threat recognition result and a risk threat-response policy association relationship, so that an automatic flow of classification recognition-risk threat recognition-response processing of the data is realized.
Referring to fig. 4, a differential privacy technology is adopted for the data query in storage, and the real-time adjustment of the noise amount during query according to the data risk threat situation specifically includes:
Acquiring all table structures of a database, selecting a privacy statistical field, and performing risk real-time adjustment on a set privacy budget by using a Sigmoid function based on the identification hit rate of a risk threat identification model in a currently set segmentation window, wherein the specific expression is as follows:
;
wherein epsilon and epsilon' are respectively set privacy budget and adjusted privacy budget, and f is the hit rate of the risk threat identification model;
when the risk threat identification hit rate is higher, the system security risk is higher, the noise scale during data query can be improved by reducing the privacy budget, and the statistical query has higher security.
Setting the time for updating the monitoring window by the statistics field, obtaining the maximum batch size of the latest time window of the privacy statistics field, and calculating the noise scale parameter by combining the adjusted privacy budget epsilon';
Selecting a random value obeying Laplace distribution, generating Laplace noise according to the noise scale and the selected random value, wherein the specific expression is as follows:
;
Wherein N is generated Laplacian noise, sgn (U) is a sign of a random value, U is a random value, deltaW is a set insertion batch size of a privacy statistics field, and lambda is a noise scale parameter;
and applying the generated noise to the query process of the privacy statistics field to increase the privacy of the statistics data query, specifically adding the true value of the statistics query to the calculated noise value to obtain an approximate statistics value.
Referring to fig. 5, sensitive data screening and cache backup are performed on dynamic data based on abnormal data risk threat conditions.
Acquiring a data type identification result of real-time data being processed by a data processing thread, setting sensitive data classification and sensitive operation keywords, retrieving data matched with the sensitive operation keywords in the sensitive data classification, marking the whole data corresponding to the keywords as sensitive data, and calculating the proportion of the sensitive data to the total processed data;
acquiring the identification hit rate of a risk threat identification model in a currently set segmentation window, carrying out weighted summation on the sensitive data duty ratio and the risk threat hit frequency, and calculating a sensitive data risk evaluation value;
Setting a sensitive data risk evaluation value threshold, and storing all marked sensitive data into a cache for backup when the risk evaluation value is greater than or equal to the sensitive data risk evaluation value threshold;
The sensitive data risk assessment threshold is set by a person skilled in the art based on the risk threat condition and the specific security requirement of the real-time data processing, and is not described herein in detail.
When the thread processes data, the thread is terminated when an uncaptured abnormality is found, which is one of the most common reasons for thread interruption, when the data format which can be processed by the system is leaked, an external person can attack the system by inputting a large amount of non-compliant data format and risk data, and at the moment, the data processing thread can be frequently interrupted, and in the process, some important sensitive data operations are lost.
If no abnormal interrupt occurs to the processing thread in the current task processing process, the cache backup is cleared after the task processing is finished, and if abnormal interrupt occurs, the breakpoint is recorded, and the cache backup data is restored and processed again.
Referring to fig. 6, detecting the current processing speed and the task accumulation to be processed, and automatically adjusting the resource allocation in combination with the threat risk of the abnormal data specifically includes:
detecting the processing speed of all current data processing threads and accumulation of tasks to be processed, and recalculating idle evaluation values of all current data processing threads;
Acquiring sensitive data risk assessment values of all current data processing threads, taking the ratio of the idle assessment value to the sensitive data risk assessment value as a resource allocation tendency coefficient, calculating the average value of the resource allocation tendency coefficients of all the data processing threads, and marking the average value as a thread processing resource allocation reference line;
When the idle evaluation value of the thread is increased and the sensitive data risk evaluation value is reduced, the residual resources of the current thread are sufficient, more resources are not required to be allocated for risk data processing, and at the moment, part of idle resources of the current thread can be recovered. Otherwise, when the idle evaluation value of the thread is reduced and the risk evaluation value of the sensitive data is increased, the current thread is indicated to be tense in residual resources and needs more system resources to ensure the processing efficiency of the risk data, and more resources are required to be allocated to the current processing thread.
Based on the deviation between the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line, the computing resource and the storage resource of each processing thread are proportionally adjusted, and the dynamic allocation balance of the data processing resources is realized.
Further, referring to fig. 7, based on the same inventive concept as the data security management method applied to the big data management platform, the present disclosure further proposes a data security management system applied to the big data management platform, including:
the data acquisition and processing distribution module is used for dividing the dynamic data set according to the size of the segmentation window based on the available resource surplus of the server according to the dynamic data set in the acquisition set time window, calculating the idle evaluation value of the data processing thread, and sequentially adding the dynamic data set into the task waiting queue of the data processing thread according to the calculation result;
the data type and risk threat identification module is used for processing data according to the history of a system platform to serve as training data, training a data classification identification model, training a data classification special risk threat identification model based on the historical risk check processing result of data classification, simultaneously establishing a risk threat-response strategy association relation, carrying out type and risk threat identification on the current processing data, and carrying out automatic response strategy selection and data risk threat processing;
The database query safety module carries out risk real-time adjustment on the set privacy budget according to the identification hit rate of the risk threat identification model in the currently set segmentation window, calculates noise scale parameters based on the insertion batch size of the privacy statistic field and the adjusted privacy budget, then selects a random value, generates Laplace noise according to the noise scale and the selected random value, and applies the generated noise to the query process of the privacy statistic field;
The automatic sensitive data backup module calculates a sensitive data risk assessment value according to the proportion of the sensitive data to the total processing data and the weighted sum result of the identification hit rate of the risk threat identification model in the currently set segmentation window, and stores all marked sensitive data into a cache for backup when the risk assessment value is greater than or equal to the sensitive data risk assessment value threshold;
And the system resource allocation self-adaptive adjustment module is used for taking the average value of the ratio of the idle evaluation value to the sensitive data risk evaluation value of all the data processing threads as a thread processing resource allocation reference line, and adjusting the computing resource and storage of each processing thread in equal proportion based on the deviation of the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line.
Still further, the present solution also proposes a data security management method storage medium applied to the big data management platform, on which a computer readable program is stored, and when the computer readable program is called, the above-mentioned data security management method applied to the big data management platform is executed.
It is understood that the storage medium may be a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, an optical medium such as a DVD, or a semiconductor medium such as a solid state disk SolidStateDisk, SSD, etc.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.