CN119293670A

CN119293670A - Data security management method and system applied to big data management platform

Info

Publication number: CN119293670A
Application number: CN202411806548.5A
Authority: CN
Inventors: 相苗苗; 任菲菲; 耿梓嫣
Original assignee: Nanjing Tongliyu Technology Co ltd
Current assignee: Nanjing Tongliyu Technology Co ltd
Priority date: 2024-12-10
Filing date: 2024-12-10
Publication date: 2025-01-10
Anticipated expiration: 2044-12-10
Also published as: CN119293670B

Abstract

The present invention discloses a data security management method and system applied to a big data management platform, which relates to the field of data security management, including: obtaining a real-time dynamic data set of big data, dividing the data set and adding it to an idle data processing queue; establishing a data threat identification model, detecting abnormal data risk threats of the platform in real time, and taking automated response measures based on the threat type; using differential privacy technology to query the data that has been stored in the database, and adjusting the noise amount during the query in real time according to the data risk threat situation; performing sensitive data screening and cache backup on dynamic data based on the abnormal data risk threat situation; detecting the current processing speed and the accumulation of tasks to be processed, and automatically adjusting resource allocation in combination with the abnormal data threat risk.

Description

Data security management method and system applied to big data management platform

Technical Field

The invention relates to the field of data security management, in particular to a data security management method and system applied to a big data management platform.

Background

With the rapid development of big data technology, the security problem of data becomes more and more important. Traditional data security management methods are difficult to cope with complex security threats in large data environments, such as data leakage, data tampering, risk access and the like. The prior art relies primarily on encryption and access control, but the performance and adaptability of these methods in large data environments is challenged. In addition, the traditional data security management method generally aims at or deals with the data security problem of a certain link in the data processing flow, and lacks the data security management of the whole data processing flow.

Therefore, how to realize a data security management method comprehensively considering the whole complete data processing life cycle of data processing distribution, data threat identification, database security inquiry, automatic sensitive data backup and resource allocation self-adaptive adjustment is a problem to be solved.

Disclosure of Invention

In order to solve the technical problems, the data security management method applied to the big data management platform is provided, and the technical scheme solves the problems in the background technology.

In order to achieve the above purpose, the invention adopts the following technical scheme:

The data security management method applied to the big data management platform comprises the following steps:

Acquiring a big data real-time dynamic data set, dividing the data set, and adding the divided data set into an idle data processing queue;

Establishing a data threat identification model, detecting abnormal data risk threats of the platform in real time, and taking automatic treatment measures based on threat types;

Adopting a differential privacy technology for the data inquiry in storage, and adjusting the noise quantity during inquiry in real time according to the data risk threat situation;

Sensitive data screening and cache backup are carried out on the dynamic data based on abnormal data risk threat conditions;

And detecting the current processing speed and the accumulation of tasks to be processed, and automatically adjusting the resource allocation by combining the threat risk of the abnormal data.

Preferably, the acquiring the big data real-time dynamic data set, dividing the data set, and adding the divided data set into an idle data processing queue specifically includes:

Acquiring a dynamic data set in a set time window, setting the size of a data processing segmentation window based on the residual available resources of a server, and segmenting the dynamic data set according to the size of the segmentation window;

Acquiring the load conditions of all the data processing threads and the task accumulation quantity of the corresponding data processing task queues, combining the set initial load weight of each thread with the ratio of the task accumulation quantity of each processing thread to the thread processing rate, and comprehensively calculating the idle evaluation value of the data processing thread;

and after the idle evaluation values of all the processing threads are subjected to reverse order sequencing, selecting the segmented dynamic data and sequentially adding the segmented dynamic data into a task waiting queue of the data processing threads according to the sequencing result.

Preferably, the establishing a data threat identification model, detecting the abnormal data risk threat of the platform in real time, and taking automatic treatment measures based on the threat type specifically includes:

Acquiring historical processing data of a system platform as training data, classifying the training data based on set data classification, and performing type attribution batch marking on the data after the type classification;

Preprocessing and feature extraction are carried out on the data subjected to batch marking, a data classification feature vector is established, a convolutional neural network is selected, a data classification recognition model is trained based on training data subjected to feature extraction, and the trained model is deployed into a system;

Setting data classification threat identification division, and establishing a corresponding threat identification model of each set data classification, wherein the model specifically comprises the following steps:

setting a risk threat classification corresponding to the data classification based on a historical risk classification processing result of the current data classification;

retrieving data specifically related to risk investigation results in training data corresponding to the current data classification, marking the data as risk source data, and attributing and marking all the risk source data based on set threat classification;

Preprocessing and feature extraction are carried out on risk source data, a risk data threat classification feature vector is established, a convolutional neural network is selected, and a special risk threat recognition model of current data classification is trained based on the feature extracted risk source data;

Combining special risk threat identification models corresponding to all data classifications and then deploying the combined special risk threat identification models into a system;

setting a corresponding response strategy based on a historical risk threat processing mode, and establishing a risk threat-response strategy association relation;

Acquiring a data processing thread, carrying out data type recognition on real-time data being processed in the data processing thread through a data classification recognition model, selecting a corresponding special risk threat recognition model to carry out risk threat recognition based on a data type recognition result, and carrying out automatic response strategy selection and data risk threat processing according to the risk threat recognition result and a risk threat-response strategy association relationship.

Preferably, the querying the data in storage by adopting a differential privacy technology, and adjusting the noise amount in real time according to the threat situation of the data risk specifically includes:

Acquiring all table structures of a database, selecting a privacy statistical field, and performing risk real-time adjustment on a set privacy budget by using a Sigmoid function based on the identification hit rate of a risk threat identification model in a currently set segmentation window, wherein the specific expression is as follows:

;

wherein epsilon and epsilon' are respectively set privacy budget and adjusted privacy budget, and f is the hit rate of the risk threat identification model;

Setting the time for updating the monitoring window by the statistics field, obtaining the maximum batch size of the latest time window of the privacy statistics field, and calculating the noise scale parameter by combining the adjusted privacy budget epsilon';

Selecting a random value obeying Laplace distribution, generating Laplace noise according to the noise scale and the selected random value, wherein the specific expression is as follows:

;

Wherein N is generated Laplacian noise, sgn (U) is a sign of a random value, U is a random value, deltaW is a set insertion batch size of a privacy statistics field, and lambda is a noise scale parameter;

and applying the generated noise to the query process of the privacy statistics field to increase the privacy of the statistics data query.

Preferably, the sensitive data screening and cache backup of the dynamic data based on the abnormal data risk threat situation specifically includes:

Acquiring a data type identification result of real-time data being processed by a data processing thread, setting sensitive data classification and sensitive operation keywords, retrieving data matched with the sensitive operation keywords in the sensitive data classification, marking the whole data corresponding to the keywords as sensitive data, and calculating the proportion of the sensitive data to the total processed data;

acquiring the identification hit rate of a risk threat identification model in a currently set segmentation window, carrying out weighted summation on the sensitive data duty ratio and the risk threat hit frequency, and calculating a sensitive data risk evaluation value;

Setting a sensitive data risk evaluation value threshold, and storing all marked sensitive data into a cache for backup when the risk evaluation value is greater than or equal to the sensitive data risk evaluation value threshold;

If no abnormal interrupt occurs to the processing thread in the current task processing process, the cache backup is cleared after the task processing is finished, and if abnormal interrupt occurs, the breakpoint is recorded, and the cache backup data is restored and processed again.

Preferably, detecting the accumulation of the current processing speed and the task to be processed, and automatically adjusting the resource allocation in combination with the threat risk of the abnormal data specifically includes:

detecting the processing speed of all current data processing threads and accumulation of tasks to be processed, and recalculating idle evaluation values of all current data processing threads;

Acquiring sensitive data risk assessment values of all current data processing threads, taking the ratio of the idle assessment value to the sensitive data risk assessment value as a resource allocation tendency coefficient, calculating the average value of the resource allocation tendency coefficients of all the data processing threads, and marking the average value as a thread processing resource allocation reference line;

the computing resources and the storage resources of each processing thread are proportionally adjusted based on the deviation of the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line.

Further, a data security management system applied to a big data management platform is provided, for implementing the data security management method applied to the big data management platform, including:

the data acquisition and processing distribution module is used for dividing the dynamic data set according to the size of the segmentation window based on the available resource surplus of the server according to the dynamic data set in the acquisition set time window, calculating the idle evaluation value of the data processing thread, and sequentially adding the dynamic data set into the task waiting queue of the data processing thread according to the calculation result;

the data type and risk threat identification module is used for processing data according to the history of a system platform to serve as training data, training a data classification identification model, training a data classification special risk threat identification model based on the historical risk check processing result of data classification, simultaneously establishing a risk threat-response strategy association relation, carrying out type and risk threat identification on the current processing data, and carrying out automatic response strategy selection and data risk threat processing;

The database query safety module carries out risk real-time adjustment on the set privacy budget according to the identification hit rate of the risk threat identification model in the currently set segmentation window, calculates noise scale parameters based on the insertion batch size of the privacy statistic field and the adjusted privacy budget, then selects a random value, generates Laplace noise according to the noise scale and the selected random value, and applies the generated noise to the query process of the privacy statistic field;

The automatic sensitive data backup module calculates a sensitive data risk assessment value according to the proportion of the sensitive data to the total processing data and the weighted sum result of the identification hit rate of the risk threat identification model in the currently set segmentation window, and stores all marked sensitive data into a cache for backup when the risk assessment value is greater than or equal to the sensitive data risk assessment value threshold;

And the system resource allocation self-adaptive adjustment module is used for taking the average value of the ratio of the idle evaluation value to the sensitive data risk evaluation value of all the data processing threads as a thread processing resource allocation reference line, and adjusting the computing resource and storage of each processing thread in equal proportion based on the deviation of the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of obtaining a big data real-time dynamic data set, dividing the data set, adding the data set into an idle data processing queue, establishing a data threat identification model, detecting abnormal data risk threats of a platform in real time, adopting automatic countermeasures based on threat types, adopting a differential privacy technology to data inquiry in storage, adjusting noise quantity in inquiry in real time according to data risk threat conditions, conducting sensitive data screening and cache backup on dynamic data based on the abnormal data risk threat conditions, detecting current processing speed and task accumulation to be processed, and automatically adjusting resource allocation in combination with abnormal data threat risks.

The data security management method comprehensively considering the whole complete data processing life cycle of data processing distribution, data threat identification, database security inquiry, automatic sensitive data backup and resource allocation self-adaptive adjustment is realized, the efficient protection and management of mass data are met, and the security and privacy of the data in the transmission, storage and processing processes are ensured.

Drawings

FIG. 1 is a flow chart of a data security management method applied to a big data management platform of the present invention;

FIG. 2 is a flow chart of the present invention for partitioning a data set and adding it to an idle data processing queue;

FIG. 3 is a flow chart of automated countermeasure based on threat types in the present invention;

FIG. 4 is a flow chart of the invention for real-time adjustment of the amount of noise in a query;

FIG. 5 is a flow chart of sensitive data screening and cache backup for dynamic data according to the present invention;

FIG. 6 is a flow chart of the adjustment of resource allocation in the present invention;

Fig. 7 is a schematic structural diagram of a data security management system applied to a big data management platform according to the present invention.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.

Referring to fig. 1, the data security management method applied to a big data management platform includes:

Referring to fig. 2, a real-time dynamic data set of big data is acquired, and the data set is divided and then added into an idle data processing queue.

in the segmentation process, each segmented data segment is required to be smaller than or equal to the set segment window size so as to ensure the integrity of the segmented data.

Acquiring the load conditions of all the data processing threads and the task accumulation quantity of the corresponding data processing task queues, combining the ratio of the task accumulation quantity of each processing thread to the thread processing speed, setting initial load weight of each thread, and taking the product of the ratio of the task accumulation quantity to the thread processing speed and the load weight as an idle evaluation value of the data processing thread;

The process aims to carry out sequential processing task distribution according to the thread processing capacity allowance, is beneficial to avoiding the situation that partial threads are overloaded or the threads are idle, and improves the utilization rate of system resources and the load balance level of the data processing process.

Referring to fig. 3, a data threat identification model is established, abnormal data risk threats of the platform are detected in real time, and automatic treatment measures are adopted based on threat types.

The method comprises the steps of establishing a data classification feature vector, wherein the data classification feature vector comprises multi-dimensional feature extraction, including data generation frequency domain features, data length features and data regularity features;

The method comprises the steps of establishing a risk data threat classification feature vector, wherein the risk data threat classification feature vector comprises multidimensional feature extraction, including data generation frequency domain features, risk threat data keyword similarity features, data length features and data regularity features;

The method comprises the steps of acquiring a data processing thread, carrying out data type recognition on real-time data being processed in the data processing thread through a data classification recognition model, selecting a corresponding special risk threat recognition model to carry out risk threat recognition based on a data type recognition result, and carrying out automatic response policy selection and data risk threat processing according to a risk threat recognition result and a risk threat-response policy association relationship, so that an automatic flow of classification recognition-risk threat recognition-response processing of the data is realized.

Referring to fig. 4, a differential privacy technology is adopted for the data query in storage, and the real-time adjustment of the noise amount during query according to the data risk threat situation specifically includes:

;

when the risk threat identification hit rate is higher, the system security risk is higher, the noise scale during data query can be improved by reducing the privacy budget, and the statistical query has higher security.

;

and applying the generated noise to the query process of the privacy statistics field to increase the privacy of the statistics data query, specifically adding the true value of the statistics query to the calculated noise value to obtain an approximate statistics value.

Referring to fig. 5, sensitive data screening and cache backup are performed on dynamic data based on abnormal data risk threat conditions.

The sensitive data risk assessment threshold is set by a person skilled in the art based on the risk threat condition and the specific security requirement of the real-time data processing, and is not described herein in detail.

When the thread processes data, the thread is terminated when an uncaptured abnormality is found, which is one of the most common reasons for thread interruption, when the data format which can be processed by the system is leaked, an external person can attack the system by inputting a large amount of non-compliant data format and risk data, and at the moment, the data processing thread can be frequently interrupted, and in the process, some important sensitive data operations are lost.

Referring to fig. 6, detecting the current processing speed and the task accumulation to be processed, and automatically adjusting the resource allocation in combination with the threat risk of the abnormal data specifically includes:

When the idle evaluation value of the thread is increased and the sensitive data risk evaluation value is reduced, the residual resources of the current thread are sufficient, more resources are not required to be allocated for risk data processing, and at the moment, part of idle resources of the current thread can be recovered. Otherwise, when the idle evaluation value of the thread is reduced and the risk evaluation value of the sensitive data is increased, the current thread is indicated to be tense in residual resources and needs more system resources to ensure the processing efficiency of the risk data, and more resources are required to be allocated to the current processing thread.

Based on the deviation between the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line, the computing resource and the storage resource of each processing thread are proportionally adjusted, and the dynamic allocation balance of the data processing resources is realized.

Further, referring to fig. 7, based on the same inventive concept as the data security management method applied to the big data management platform, the present disclosure further proposes a data security management system applied to the big data management platform, including:

Still further, the present solution also proposes a data security management method storage medium applied to the big data management platform, on which a computer readable program is stored, and when the computer readable program is called, the above-mentioned data security management method applied to the big data management platform is executed.

It is understood that the storage medium may be a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, an optical medium such as a DVD, or a semiconductor medium such as a solid state disk SolidStateDisk, SSD, etc.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A data security management method applied to a big data management platform, characterized by comprising:

Obtain a real-time dynamic data set of big data, split the data set and add it to the idle data processing queue;

Establish a data threat identification model to detect abnormal data risk threats on the platform in real time and take automated response measures based on the threat type;

Differential privacy technology is used to query stored data, and the amount of noise during query is adjusted in real time according to the data risk threat situation;

Perform sensitive data screening and cache backup of dynamic data based on abnormal data risk threats;

Detect the current processing speed and the backlog of pending tasks, and automatically adjust resource allocation based on the risk of abnormal data threats.

2. The data security management method applied to a big data management platform according to claim 1 is characterized in that the step of obtaining a real-time dynamic data set of big data and dividing the data set and adding it to an idle data processing queue specifically comprises:

Get the dynamic data set within the set time window, set the data processing segmentation window size based on the available resources remaining on the server, and divide the dynamic data set according to the segmentation window size;

Obtain the load conditions of all data processing threads and the number of tasks accumulated in the corresponding data processing task queues, and calculate the idle evaluation value of the data processing thread by combining the ratio of the task accumulation amount of each processing thread to the thread processing rate with the set initial load weight of each thread;

After the idle evaluation values of all processing threads are sorted in reverse order, the segmented dynamic data are selected and added to the task waiting queue of the data processing thread in sequence according to the sorting results.

3. The data security management method applied to a big data management platform according to claim 2 is characterized in that the establishment of a data threat identification model, real-time detection of abnormal data risk threats of the platform, and taking automated response measures based on threat types specifically include:

Obtain historical processing data from the system platform as training data, classify the training data based on the set data classification, and batch-label the data after classification;

Preprocess and extract features of batch labeled data, establish data classification feature vectors, select convolutional neural networks, train data classification and recognition models based on feature-extracted training data, and deploy the trained models into the system;

Set up data classification threat identification division of labor and establish corresponding threat identification models for each set data classification, specifically:

Based on the historical risk screening results of the current data classification, set the risk threat classification corresponding to the data classification;

Retrieve the data specifically involved in the risk screening results in the training data corresponding to the current data classification, mark it as risk source data, and attribute and label all risk source data based on the set threat classification;

Preprocess and extract features of risk source data, establish risk data threat classification feature vectors, select convolutional neural networks, and train a dedicated risk threat identification model for current data classification based on the risk source data after feature extraction;

Combine and deploy the dedicated risk threat identification models corresponding to all data classifications into the system;

Set corresponding response strategies based on historical risk threat handling methods and establish risk threat-response strategy associations;

Get the data processing thread, identify the data type of the real-time data being processed in the data processing thread through the data classification recognition model, select the corresponding dedicated risk threat identification model for risk threat identification based on the data type identification result, and perform automated response strategy selection and data risk threat processing based on the risk threat identification result and the risk threat-response strategy association.

4. According to the data security management method applied to the big data management platform of claim 3, it is characterized in that the query of the stored data adopts differential privacy technology, and according to the data risk threat situation, the noise amount during the query is adjusted in real time, specifically including:

Get all the table structures of the database, select the privacy statistical fields, and use the Sigmoid function to adjust the risk of the set privacy budget in real time based on the recognition hit rate of the risk threat recognition model in the current set segmentation window. The specific expression is:

;

In the formula, ε and ε’ are the set privacy budget and the adjusted privacy budget respectively, and f is the recognition hit rate of the risk threat identification model;

Set the statistical field update monitoring window time, obtain the maximum batch size of the most recent time window of the privacy statistical field, and calculate the noise scale parameter based on the adjusted privacy budget ε’;

Select a random value that obeys the Laplace distribution, and generate Laplace noise according to the noise scale and the selected random value. The specific expression is:

;

Where N is the generated Laplace noise, sgn(U) is the sign of the random value, U is the random value, △W is the set insertion batch size of the privacy statistics field, and λ is the noise scale parameter;

The generated noise is applied to the query process of the privacy statistical field to increase the privacy of statistical data queries.

5. The data security management method applied to a big data management platform according to claim 4 is characterized in that the sensitive data screening and cache backup of dynamic data based on abnormal data risk threat conditions specifically includes:

Obtain the data type identification result of the real-time data being processed by the data processing thread, set the sensitive data classification and sensitive operation keywords, retrieve the data matching the sensitive operation keywords in the sensitive data classification, mark the entire data corresponding to the keyword as sensitive data, and calculate the proportion of sensitive data in the total processed data;

Obtain the recognition hit rate of the risk threat recognition model within the currently set segment window, perform weighted summation on the proportion of sensitive data and the risk threat hit frequency, and calculate the risk assessment value of sensitive data;

Set a sensitive data risk assessment value threshold. When the risk assessment value is greater than or equal to the sensitive data risk assessment value threshold, store all marked sensitive data in the cache backup;

If the processing thread does not experience an abnormal interruption during the current task processing, the cache backup will be cleared after the task processing is completed. If an abnormal interruption occurs, the breakpoint will be recorded and the cache backup data will be restored and reprocessed.

6. The data security management method applied to a big data management platform according to claim 5 is characterized in that the detecting the current processing speed and the accumulation of tasks to be processed and automatically adjusting resource allocation in combination with the abnormal data threat risk specifically comprises:

Detect the processing speed of all current data processing threads and the backlog of pending tasks, and recalculate the idle evaluation values of all current data processing threads;

Obtain the sensitive data risk assessment values of all current data processing threads, use the ratio of the idle assessment value to the sensitive data risk assessment value as the resource allocation tendency coefficient, calculate the average value of the resource allocation tendency coefficients of all data processing threads, and mark it as the thread processing resource allocation reference line;

Based on the deviation of the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line, the computing resources and storage resources of each processing thread are adjusted in proportion.

7. A data security management system applied to a big data management platform, used to implement the data security management method applied to a big data management platform as claimed in any one of claims 1 to 6, characterized in that it includes:

A data acquisition and processing distribution module, which acquires a dynamic data set within a set time window, divides the dynamic data set into segmented window sizes based on the remaining available resources of the server, calculates the idle evaluation value of the data processing thread, and sequentially adds the dynamic data set to the task waiting queue of the data processing thread according to the calculation results;

A data type and risk threat identification module, which uses the historical processing data of the system platform as training data, trains a data classification identification model, and based on the historical risk screening processing results of data classification, trains a dedicated risk threat identification model for data classification, and simultaneously establishes a risk threat-response strategy association relationship, identifies the type and risk threat of the currently processed data, and performs automated response strategy selection and data risk threat processing;

A database query security module, which performs real-time risk adjustment on a set privacy budget according to the recognition hit rate of the risk threat recognition model in the currently set segment window, calculates a noise scale parameter based on the insertion batch size of the privacy statistics field and the adjusted privacy budget, and then selects a random value, generates Laplace noise according to the noise scale and the selected random value, and applies the generated noise to the query process of the privacy statistics field;

A sensitive data automatic backup module, which calculates the sensitive data risk assessment value according to the weighted sum of the proportion of sensitive data to the total processed data and the recognition hit rate of the risk threat recognition model in the currently set segment window, and when the risk assessment value is greater than or equal to the sensitive data risk assessment value threshold, stores all marked sensitive data in the cache backup;

A system resource allocation adaptive adjustment module, which uses the average value of the ratio of the idle assessment value to the sensitive data risk assessment value of all data processing threads as the thread processing resource allocation reference line, and proportionally adjusts the computing resources and storage of each processing thread based on the deviation between the resource allocation tendency coefficient of each processing thread and the thread processing resource allocation reference line.