CN120316757B

CN120316757B - A user identity threat detection method and system based on large language model

Info

Publication number: CN120316757B
Application number: CN202510821427.6A
Authority: CN
Inventors: 杨传磊
Original assignee: Jiangsu Taihu Huiyun Data System Co ltd
Current assignee: Jiangsu Taihu Huiyun Data System Co ltd
Priority date: 2025-06-19
Filing date: 2025-06-19
Publication date: 2025-08-15
Anticipated expiration: 2045-06-19
Also published as: CN120316757A

Abstract

The invention discloses a user identity threat detection method and system based on a large language model in the technical field of network security, which comprises establishing a model based on an open source deep learning framework, and acquiring historical data set input and training a model to obtain a large language model, wherein the large language model is used for distinguishing the characteristic difference between the normal behavior and the threat behavior of the user identity. The invention carries out deep processing on the acquired data set through the established large language model, acquires data related to user identity threat and user multisource data in multiple aspects of a network security data platform and an enterprise internal system, constructs user behavior semantic association map data through the converted high-dimensional semantic vector, can deeply reveal the user identity threat data through multi-dimensional data analysis, realizes accurate and efficient detection and prevention of the user identity threat, effectively reduces false alarm rate and false alarm rate, reduces invalid work of security management personnel, and improves threat detection reliability.

Description

User identity threat detection method and system based on large language model

Technical Field

The invention relates to the technical field of network security, in particular to a user identity threat detection method and system based on a large language model.

Background

User identity threats refer to potential security risks or malicious behavior with respect to user identity information in a network environment. These threats may originate from a variety of sources, such as hacking, malware, phishing websites, etc., that are intended to steal, tamper with, or abuse the identity information of the user. The invention can accurately identify potential user identity threat by constructing a large language model and carrying out deep analysis on user behaviors, thereby providing powerful support for network security protection.

In the prior art, the Chinese patent with the patent number of CN202410721431.0 discloses a method for detecting abnormal behaviors of a database user based on a graphic neural network model, although learning of a complex relationship between a user and a behavior graph is realized through the graphic neural network model, abnormal behaviors can be early warned in real time, and detection accuracy is improved through multi-dimensional feature extraction, the method is mainly developed around structured data of a database operation scene, such as SQL statement analysis and data table access statistics.

The existing user identity threat detection is insufficient in consideration of multi-source data fusion, is difficult to meet analysis requirements of cross-platform user identity threats, has poor multi-source data deep analysis effect, has certain limitation on self-adaptive learning ability of dynamic threat environment, and has poor use effect.

Disclosure of Invention

The invention aims to provide a user identity threat detection method and system based on a large language model, which are used for solving the problems that the consideration on the aspect of multi-source data fusion is insufficient, the analysis requirement of cross-platform user identity threat is difficult to meet and the deep analysis effect on multi-source data is poor.

In order to achieve the above purpose, the present invention provides the following technical solutions:

In a first aspect, the present invention provides a method for detecting a threat to a user identity based on a large language model, which is characterized by comprising:

establishing a model based on an open source deep learning framework, acquiring a historical data set, inputting and training the model to obtain a large language model, wherein the large language model is used for distinguishing characteristic differences of normal behaviors and threat behaviors of a user identity;

Customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to the large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, cleaning and marking the data through the large language model to obtain a labeled data set;

analyzing, extracting and processing the labeled data set based on a cluster computing system and a large language model to obtain a characteristic data set, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics;

Converting the feature data set into a high-dimensional semantic vector based on a word vector model, and inputting the high-dimensional semantic vector into a large language model to construct user behavior semantic association map data;

Inputting the feature data set and the user behavior semantic association graph data into a large language model, and calculating the large language model to obtain threat probability data and generating a threat report;

starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;

and acquiring historical event data, inputting the large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.

The invention further provides a method for establishing a model based on an open source deep learning framework, obtaining a historical data set input and training the model to obtain a large language model, wherein the large language model is used for distinguishing the characteristic difference between normal behavior and threat behavior of a user identity, and comprises the following steps:

The historical data comprise normal behavior data and threat behavior of a user history acquired by a network security data platform and normal behavior data and threat behavior data of a user acquired by an enterprise internal system;

and inputting historical data and training the model to obtain a large language model, wherein the model training process adopts supervised learning, marks the normal behavior data in the historical data set as 0 and the threat behavior data in the historical data set as 1, trains and optimizes the model based on an adaptive learning rate optimization algorithm and a cross entropy loss function to obtain the large language model, and enables the large language model to distinguish the characteristic difference between the normal behavior and the threat behavior of the user identity.

The invention further provides a method for customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to the large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, and compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, wherein the method comprises the following steps:

The large language model deeply analyzes the page structure, the data interface specification and the updating frequency of the network security data platform, automatically sends a data sending request of the network security data platform at preset time through the data acquisition crawler, and acquires user identity threat information data of the network security data platform in the global scope, wherein the user identity threat information data comprises malicious behavior pattern data and identity embezzlement data;

Processing and acquiring user multi-source data of an enterprise internal system in a preset time period according to the data acquisition crawler data, wherein the multi-source data comprises user login time, user login log, user IP address and user operation behavior;

And compressing the user identity threat information data, the user login time, the user login log, the user IP address and the user operation behavior to obtain an acquisition data set so as to improve the transmission efficiency of the acquisition data set.

The invention further provides a method for importing the collected data set into a data processing platform and cleaning and labeling the collected data set through a large language model to obtain a labeled data set, comprising the following steps:

Identifying the collected data set according to a large language model, and cleaning format errors and repeated redundant data;

Labeling the cleaned collected data set based on the labeling tool of the large language model, labeling the preset financial and privacy information in the collected data set as a high-sensitivity data set, labeling the preset conventional business in the collected data set as a low-sensitivity data set, and obtaining a role data set according to department information of a user, wherein the high-sensitivity data set, the bottom-sensitivity data set and the role data set form a labeled data set, and the labeled data set is used for determining the behavior type of the user.

The method for extracting the feature data set based on the cluster computing system and the large language model analysis and extraction comprises the following steps:

the tagged data set is analyzed and extracted based on a cluster computing system and a large language model to obtain accurate login time, longitude and latitude, equipment type, file name, file type, file operation type, change time and authority content of a user, and the accurate login time, longitude and latitude, equipment type, file name, file operation type, change time and authority content are integrated into a characteristic data set, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics.

The invention further provides a method for converting the feature data set into a high-dimensional semantic vector based on a word vector model, wherein the high-dimensional semantic vector is input into a large language model to construct user behavior semantic association graph data, and the method comprises the following steps:

converting the operation features of the users in the feature data set into high-dimensional semantic vectors based on a word vector model;

and inputting the high-dimensional semantic vector into the large language model, calculating cosine similarity between different high-dimensional semantic vectors based on semantic analysis of the large language model, and constructing a user behavior semantic association map.

The invention further provides a method for inputting the feature data set and the user behavior semantic association map data into a large language model, wherein the large language model calculates threat probability data and generates a threat report, and the method comprises the following steps:

inputting the processed characteristic data set and the user behavior semantic association map data into a large language model;

If the feature data set and the user behavior semantic association map data trigger a preset abnormal threshold value in a large language model, threat probability data are generated, the threat probability data are classified and classified in a plurality of grades through the large language model, the large language model generates detailed threat reports according to the classified threat probability data, and the threat reports comprise external threat reports of a network security data platform and internal threat reports of an enterprise internal system;

generating early warning information according to the threat report, and sending the early warning information based on message service of the Internet, wherein the early warning information comprises threat time, place, user and resource information.

The method comprises the steps of starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform, and the method comprises the following steps:

Starting a response mechanism preset by the network security data platform according to the grade grading and classification corresponding to the external threat report of the network security data platform in the threat report, wherein the response mechanism preset by the network security data platform comprises an application programming interface freezing related user account number of an identity authentication system;

The method comprises the steps that a response mechanism preset by an enterprise internal system is started by the corresponding grade grading and classification of the internal threat report of the enterprise internal system in the threat report, wherein the response mechanism preset by the enterprise internal system comprises access permission limitation of an application programming interface of a permission management system, and preset permission is reserved;

And starting a detailed investigation flow based on the automation script, and collecting relevant operation logs and evidences.

The method for acquiring the historical event data and inputting the large language model, and adjusting the preset parameters and the real-time response mechanism of the large language model comprises the following steps:

the historical event data are successfully prevented threat event data and missed threat event data, and parameters and a real-time response mechanism preset by the large language model are adjusted according to the historical event data.

In a second aspect, the present invention provides a large language model-based user identity threat detection system, the system comprising:

The training model building module is used for building a model based on an open source deep learning framework, acquiring a historical data set input and training the model to obtain a large language model;

The data acquisition module is used for customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to the large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, and cleaning and labeling the data through the large language model to obtain a labeled data set;

The extraction processing module is used for analyzing, extracting and processing the tagged data set based on a cluster computing system and a large language model to obtain a characteristic data set;

the conversion module is used for converting the characteristic data set into a high-dimensional semantic vector based on a word vector model, and the high-dimensional semantic vector is input into a large language model to construct user behavior semantic association map data;

The calculation generation module is used for inputting the characteristic data set and the user behavior semantic association map data into a large language model, calculating threat probability data by the large language model and generating a threat report;

the response module is used for starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;

the adjustment module is used for acquiring historical event data and inputting the large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.

Compared with the prior art, the invention has the beneficial effects that:

1. In the invention, the collected data set is deeply processed through the established large language model, the data related to the user identity threat and the user multi-source data are acquired in multiple aspects of a network security data platform and an enterprise internal system, the user behavior semantic association map data are constructed through the converted high-dimensional semantic vector, the user identity threat data can be deeply revealed through multi-dimensional data analysis, the accurate and efficient detection and prevention of the user identity threat are realized, the false alarm rate and the false alarm rate are effectively reduced, the invalid work of security management personnel is reduced, and the threat detection reliability is improved.

2. According to the invention, by starting the preset real-time response mechanism according to the threat report, the system can quickly react when the potential threat is detected, so that the delay and uncertainty of manual intervention are avoided, the timeliness and accuracy of threat response are effectively improved, the cooperative effect of the preset response mechanism of the internal system of the enterprise and the preset response mechanism of the network safety data platform is ensured, threat information can be efficiently transmitted and processed across platforms, the overall safety performance of the system is further enhanced, threat response and preset parameter adjustment are intelligently optimized through historical event data, a large language model can be adaptively learned to continuously adapt to dynamic threat environment, and more comprehensive and effective guarantee is provided for user identity safety.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples:

referring to fig. 1, in an embodiment of the present invention, a method for detecting a threat of a user identity based on a large language model includes:

S1, establishing a model based on an open source deep learning framework, acquiring a historical data set input, and training the model to obtain a large language model, wherein the large language model is used for distinguishing characteristic differences of normal behaviors and threat behaviors of a user identity;

S2, customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to a large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, and cleaning and labeling the data through the large language model to obtain a labeled data set;

S3, analyzing, extracting and processing the labeled data set based on a cluster computing system and a large language model to obtain a characteristic data set, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics;

S4, converting the feature data set into a high-dimensional semantic vector based on the word vector model, and inputting the high-dimensional semantic vector into a large language model to construct user behavior semantic association map data;

S5, inputting the feature data set and the user behavior semantic association map data into a large language model, and calculating the large language model to obtain threat probability data and generating a threat report;

S6, starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;

s7, acquiring historical event data, inputting a large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.

Specifically, in the invention, through the natural language processing and knowledge reasoning capability of the established large language model, the collected data set is subjected to advanced treatment, the data related to the user identity threat and the user multi-source data are acquired in multiple aspects of a network security data platform and an enterprise internal system, the user behavior semantic association map data are constructed through the converted high-dimensional semantic vector, the user identity threat data can be deeply revealed through multi-dimensional data analysis, the accurate and efficient detection and prevention of the user identity threat are realized, the false alarm rate and the false alarm rate are effectively reduced, the invalid work of security management personnel is reduced, and the reliability of threat detection is improved;

Furthermore, by starting a preset real-time response mechanism according to the threat report, the system can quickly react when the potential threat is detected, and delay and uncertainty of manual intervention are avoided, so that timeliness and accuracy of threat response are effectively improved, the cooperative effect of the response mechanism preset by the internal system of the enterprise and the response mechanism preset by the network security data platform is ensured, threat information can be efficiently transmitted and processed across platforms, overall security performance of the system is further enhanced, threat response and preset parameter adjustment are intelligently optimized through historical event data, a large language model can be adaptively learned to continuously adapt to dynamic threat environments, and more comprehensive and effective guarantee is provided for user identity security.

Preferably, a model is built based on an open source deep learning framework, a historical data set is acquired and input and the model is trained to obtain a large language model, and the large language model is used for distinguishing characteristic differences between normal behaviors and threat behaviors of a user identity, and the method comprises the following steps:

The open source deep learning framework comprises PyTorch framework building models, wherein PyTorch is an open source deep learning framework, such as GPT series, and the language modes and rules are learned by performing unsupervised learning on large-scale text data, and the historical data comprise normal behavior data and threat behavior of a user history acquired by a network security data platform and normal behavior data and threat behavior data of a user acquired by an enterprise internal system;

And inputting the historical data and training the model to obtain a large language model, wherein the model training process adopts supervised learning, marks the normal behavior data in the historical data set as 0 and the threat behavior data as 1, trains the optimization model based on the self-adaptive learning rate optimization algorithm and the cross entropy loss function to obtain the large language model, and makes the large language model distinguish the characteristic difference of the normal behavior and the threat behavior of the user identity.

Specifically, the adaptive learning rate optimization algorithm comprises an Adam optimizer, and the Adam optimizer has wide application in various deep learning tasks, such as image classification, target detection, natural language processing and the like. Many deep learning frameworks provide the implementation of an Adam optimizer, are convenient for users to use, mark normal behaviors as 0, mark threat behaviors as 1, train 100 rounds by adopting the Adam optimizer and a cross entropy loss function, and avoid overfitting by an early stop method so that the model learns the characteristic difference of distinguishing normal and threat behaviors.

Preferably, based on the web crawler framework customization development data acquisition crawler, according to the deep analysis network security data platform of big language model and obtain user's identity threat intelligence data through data acquisition crawler, data acquisition crawler obtains the user multisource data of enterprise's internal system, compresses user's identity threat intelligence data and multisource data and obtains the collection dataset, includes:

The method comprises the steps that a large language model deeply analyzes the page structure, data interface specifications and updating frequency of a network security data platform, automatically sends a data sending request of the network security data platform at preset time through a data acquisition crawler, and acquires user identity threat information data of the network security data platform in the global scope, wherein the user identity threat information data comprises malicious behavior pattern data and identity theft data;

The user identity threat information data, the user login time, the user login log, the user IP address and the user operation behavior are compressed to obtain an acquisition data set, so that the transmission efficiency of the acquisition data set is improved, and the full coverage of the acquisition data set in a network security data platform and an enterprise internal system and the updating of the acquisition data set in a preset time are ensured.

The web crawler framework comprises a Scrapy framework of Python, the Scrapy framework of Python is used for customizing and developing an efficient data acquisition crawler, the large language model deeply analyzes the page structure, the data interface specification and the updating frequency of the network security data platform, the network security data platform comprises KASPERSKY THREAT INTELLIGENCE Portal, symantec Security Response and the like, and the large language model can accurately understand and analyze the complex page structure of the network security data platform through the strong natural language processing capability of the network security data platform, so that the accuracy and the integrity of data acquisition are ensured. Meanwhile, the data interface specification of the platform is carefully researched, so that the data acquisition crawler can efficiently and stably interact with the platform. In addition, the large language model closely focuses on the updating frequency of the platform, and the data acquisition strategy is timely adjusted to cope with the possible change of the page structure or the data interface of the platform, so that the continuity and timeliness of data acquisition are ensured, the preset time is 5-30 minutes, preferably 10 minutes, and the latest user identity threat information in the global scope is acquired;

The data acquisition crawler data comprises an ETL tool, wherein the ETL tool is an abbreviation of Extract, transform and Load, namely a process of data extraction, conversion and loading. The data acquisition is carried out in a preset time period, wherein the preset time period is a time period from 2 a.m. to 6 a.m., and the ETL tool is responsible for acquiring the data such as user login time, IP address, operation behavior and the like from an enterprise internal system (such as a network access log and a rights management system) in a full quantity. Through effective application of the ETL tool, the efficiency and accuracy of data acquisition can be greatly improved, and a solid data basis is provided for subsequent user identity threat detection.

Taking KASPERSKY THREAT INTELLIGENCE Portal as an example, the large language model analyzes the data interface document, determines parameters and formats of the data request, automatically sends the data request to the platform every 10 minutes by utilizing Scrapy scheduler setting, and acquires the latest user identity threat information in the global scope, including data such as malicious behavior patterns, identity theft cases and the like. Meanwhile, the ETL tool is utilized to collect data such as user login time, IP address, operation behavior and the like from an enterprise internal system (such as a network access log and a right management system) in a full quantity at 2 early morning with lower system load, and the transmission efficiency is improved through a data compression technology, so that the full coverage and timely updating of internal and external data are ensured.

Preferably, importing the collected data set into a data processing platform, cleaning and labeling the collected data set through a large language model to obtain a labeled data set, including:

identifying the collected data set according to the large language model, and cleaning format errors and repeated redundant data;

the method comprises the steps that a cleaned collected data set is marked by a marking tool based on a large language model, preset financial and privacy information in the collected data set is marked as a high-sensitivity data set, preset conventional business related in the collected data set is taken as a bottom-sensitivity data set, a role data set is obtained according to department information of a user, the high-sensitivity data set, the bottom-sensitivity data set and the role data set form a labeled data set, and the labeled data set is used for determining a user behavior type.

Specifically, the data processing platform comprises a Hadoop platform, which is an open-source, expandable and distributed computing platform developed by the Apache software foundation. The method is mainly used for processing large-scale data sets, can run on a cluster formed by a large number of common servers, and provides a high-efficiency and reliable solution for storing and processing mass data;

And importing the acquired data set into a Hadoop platform, identifying and cleaning format errors and repeated redundant data according to a large language model by combining regular expressions and semantic analysis, and marking cleaned data by a part-of-speech marking and named entity identification technology based on a marking tool developed by the large language model. By taking user operation behavior data as an example, identifying operation verbs (such as login, reading and modifying) by using a part-of-speech labeling technology to determine the user behavior type, identifying entities such as user names, file names and the like by using a named entity identification technology, and judging the sensitivity degree of the data by combining the understanding of a large language model to business logic. The preset routine business comprises routine operation of common business files, routine operation of the common business files, low sensitivity is marked to obtain a low-sensitivity data set, a user role (such as common staff, department manager, system manager and the like) is determined by utilizing knowledge graph technology of a large language model according to department and position information of the user to obtain a role data set, and clear and accurate labeled data is provided for subsequent feature extraction and analysis

Preferably, the feature data set is obtained by analyzing, extracting and processing the labeled data set based on a cluster computing system and a large language model, and the method comprises the following steps:

the tagged data set is analyzed and extracted based on a cluster computing system and a large language model to obtain accurate login time, longitude and latitude, equipment type, file name, file type, file operation type, change time and authority content of a user, and the accurate login time, longitude and latitude, equipment type, file name, file type, file operation type, change time and authority content are integrated into a characteristic data set which is processed and obtained, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics.

The cluster computing system comprises a Spark platform and a pandas library, processes the labeled dataset by using the Spark platform and the pandas library, and extracts basic features by combining the large language model analysis capability. The tagged data set includes the resolved precise login time, longitude and latitude (call GeoPy tool) and device type in the login log. And extracting the file name, the type and the operation type from the file operation records included in the labeled data set. Change time, rights content and the like are extracted from rights change records included in the tagged data set. The data set containing three basic characteristics of login characteristics, operation characteristics and authority characteristics is integrated and formed, and support is provided for deep analysis.

Preferably, the feature data set is converted into a high-dimensional semantic vector based on the word vector model, and the high-dimensional semantic vector is input into the large language model to construct user behavior semantic association graph data, which comprises the following steps:

converting the operation features of the users in the feature data set into high-dimensional semantic vectors based on the word vector model;

the high-dimensional semantic vectors are input into a large language model, cosine similarity among different high-dimensional semantic vectors is calculated based on semantic analysis of the large language model, and a user behavior semantic association map is constructed.

The map can intuitively display the relevance and the mode between the behaviors of the user and help identify the abnormal behaviors. By setting a threshold, when the cosine similarity is lower than or higher than a preset value, an alarm mechanism is triggered to prompt the possible identity threat. In addition, the map can be dynamically updated, and the user behavior model is continuously optimized along with the addition of new data, so that the detection accuracy and timeliness are improved.

Specifically, the word vector model is based on a BERT word vector model, the BERT word vector technology is adopted to convert the text of the operation features in the feature data set into a high-dimensional semantic vector (e.g. "the user Zhang San10 10 days 2024 reads the financial quarter report. Docx file" generates 768-dimensional vector), and calculating cosine similarity between different operation behavior vectors by using semantic analysis capability of a large language model, and constructing a user behavior semantic association map. A similarity threshold of 0.7 is set to identify behavior pattern association, and when the similarity of two operation behavior vectors exceeds the threshold, the two behaviors are considered to have stronger association semantically. For example, if a user frequently reads a financial statement class file, when a new operation of reading a financial related file occurs, the similarity between the behavior vector and the historical behavior vector is higher, which indicates that the behavior accords with the normal behavior mode of the user, otherwise, if the similarity is lower, an abnormality may exist. By constructing the semantic association graph, potential changes and associations of the user behavior patterns can be found, and deeper semantic feature basis is provided for threat detection.

Training a user behavior portrayal algorithm based on a large language model, and determining a user routine login period through time sequence analysis. For example, by analyzing login time data of a user over one month, it was found that the user a generally logged in the system between 9 to 18 points on the workday, and the login peak was concentrated at 9 to 10 points 30 minutes. The type and the frequency of the user accessing the file are statistically analyzed, and the service requirement and the working habit of the user are mastered by combining the semantic understanding of the large language model on the file content. If a user frequently accesses sales data report class files and frequently reads and analyzes the files, this indicates that the user's work may be relevant to sales data analysis. And combining the role and authority information of the user, and refining and dynamically adjusting the behavior portraits of the user by utilizing the knowledge reasoning capability of the large language model. For example, a system administrator user has higher authority, and its operation behavior may involve advanced operations such as system configuration change, user authority management, etc., while a general staff user has lower authority, mainly performing daily business operations. When the role or authority of the user changes, the model updates the user behavior portraits in real time according to the new information, so that the portraits can accurately reflect the normal behavior mode of the user.

And carrying out statistical analysis on the historical data by using a large language model, setting personalized abnormal thresholds for each user based on a 3 sigma principle, wherein the login place deviates from a resident place by more than 5 degrees, the login time deviates from a mean value by 3 times, the standard deviation is regarded as abnormal, the file access frequency exceeds the historical mean value by 3 times or abnormal operation types (such as ordinary staff try permission modification) trigger early warning, and the thresholds are used as important reference standards for subsequent threat detection and are used for judging whether the user behavior deviates from a normal mode.

Preferably, the feature data set and the user behavior semantic association map data are input into a large language model, threat probability data are calculated by the large language model, and a threat report is generated, including:

Inputting the processed feature data set and the user behavior semantic association map data into a large language model;

If the feature data set and the user behavior semantic association graph data trigger a preset abnormal threshold value in the large language model, threat probability data are generated, the threat probability data are classified and classified in a plurality of grades through the large language model, the large language model generates a detailed threat report according to the classified threat probability data, and the threat report comprises an external threat report of a network security data platform and an internal threat report of an enterprise internal system;

And generating early warning information according to the threat report, and sending the early warning information based on the message service of the Internet, wherein the early warning information comprises threat time, place, user and resource information.

Specifically, the feature data set collected in real time and the user behavior semantic association map data are input into a trained large language model to trigger a preset abnormal threshold value, threat probability is calculated, if the abnormal threshold value is triggered (for example, 3 different places are logged in for 10 minutes, abnormal equipment is accessed in an unauthorized mode), the message service of the Internet comprises a short message gateway and a mail service, the system sends early warning to security personnel through the short message gateway and the mail service, threat time (accurate to seconds), place (IP is analyzed to street level), user and resource information are included, and timely response is ensured.

When a threat is detected, the threat is classified and ranked using a large language model. The threats are classified into different categories such as identity theft threat of external hackers, malicious operation threat of internal staff, authority abuse risk threat, account abnormal sharing threat and the like according to the nature, source and possible influence of the threats. Through analysis of threat behavior characteristics, a text classification technology of a large language model is applied to determine the category to which the threat belongs. For example, for the case that a large number of violent login attempts are initiated by an external IP address and the account used for login is a common service account, a large language model is utilized to analyze the behavior mode and the characteristics of the account, and the identity theft threat of an external hacker is judged. Meanwhile, the threat is classified into a plurality of grades according to factors such as data sensitivity degree related to the threat, affected service range, potential economic loss and the like, wherein the plurality of grades comprise high, medium and low. The method comprises the steps of judging the threat of serious economic loss caused by normal operation of core business involving high sensitive data (such as customer identification card number and bank card information) leakage risk as high-grade threat, judging the threat of small economic loss as medium-grade threat while influencing the operation of business involving general business data leakage, and judging the threat of small influence on business as low-grade threat while only involving a small amount of non-sensitive data abnormal operation. So that the security manager takes corresponding countermeasures according to the severity of the threat.

The large language model generates detailed threat reports according to threat features and related knowledge, wherein the threat reports comprise threat details such as violent login times, abnormal operation content, hazard assessment such as data security and business influence, and suggested countermeasures such as frozen account numbers and IP tracking. For example, the external theft threat report needs to include the attack IP attribution, the affected account authority level and the emergency treatment flow, and assist security personnel in quickly making a scientific treatment scheme.

Preferably, a preset real-time response mechanism is started according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an internal system of an enterprise and a response mechanism preset by a network security data platform, and the method comprises the following steps:

The method comprises the steps that a response mechanism preset by an enterprise internal system is started by the corresponding grade grading and classification of the internal threat report of the enterprise internal system in the threat report, the response mechanism preset by the enterprise internal system comprises access permission limitation of an application programming interface of a permission management system, and preset permission is reserved;

Specifically, after threat early warning is received, the system automatically starts a real-time response mechanism. Different countermeasures are taken depending on the threat category and severity. For high risk external hacker identity theft threats, the system immediately freezes the relevant user account through the application programming interface (API interface) of the identity authentication system, preventing further operation thereof. Meanwhile, a short message gateway interface is called to send emergency notification containing threat details to a mobile phone of a security manager, for example, an external hacker is found to perform violent login attack on a user [ user name ] account, the account is frozen, and timely processing is requested. For the malicious operation threat of the internal staff, the system limits the access rights of the staff by using an application programming interface (API interface) of a rights management system (such as Windows SERVER ACTIVE direct rights management), only the preset rights are reserved, and the preset rights comprise basic necessary rights (such as the rights for viewing personal work tasks). And starting a detailed investigation flow through an automation script, and collecting relevant operation logs and evidences. For example, all operation records of the employee before and after the threat are extracted from the network access log system and the business operation record system and stored in a special investigation database for subsequent deep analysis.

Preferably, the method for acquiring the historical event data and inputting the historical event data into the large language model, and adjusting the preset parameters and the real-time response mechanism of the large language model comprises the following steps:

The historical event data is successfully prevented threat event data and missed threat event data, and parameters and a real-time response mechanism preset by the large language model are adjusted according to the historical event data.

Specifically, the large language model analyzes the historical events and summarizes the successful defensive experience and failed training. For example, a case of successfully preventing external hacking is analyzed, the characteristics of attack behaviors, detection processes and response measures are deeply analyzed by using a large language model, and effective detection rules and response strategies are summarized, such as timely discovery of abnormal login behaviors and rapid account freezing. Analyzing a certain false negative threat event to find out the defect of the model in the detection process, wherein if a certain detection rule is too loose, the threat cannot be found in time. These experiences and training are fed back into the threat detection model and response mechanism, and the parameters and response strategies of the model are adjusted. If it is found that a certain detection threshold is unreasonably set to cause missing report, the large language model can recalculate and adjust the threshold according to the historical data. And analyzing a large amount of historical data and current threat information by using a large language model to predict potential threat trend. Novel threat modes and attack means which may occur in the future, such as novel identity theft attacks which bypass detection by using artificial intelligence technology, are predicted through time series analysis, machine learning algorithms (such as decision trees and random forests) and knowledge reasoning capability of large language models. Corresponding defense strategies, such as developing new detection algorithms, updating defense rules and the like, are formulated in advance, so that active defense is realized.

A user identity threat detection system based on a large language model, which comprises the following steps:

The training model building module is used for building a model based on an open source deep learning framework, acquiring historical data set input and training the model to obtain a large language model;

The data acquisition module is used for customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to a large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, and cleaning and marking the data through the large language model to obtain a labeled data set;

The extraction processing module is used for analyzing, extracting and processing the tagged data set based on the cluster computing system and the large language model to obtain a characteristic data set;

The conversion module is used for converting the feature data set into a high-dimensional semantic vector based on the word vector model, and the high-dimensional semantic vector is input into the large language model to construct user behavior semantic association map data;

The calculation generation module is used for inputting the feature data set and the user behavior semantic association graph data into a large language model, calculating the large language model to obtain threat probability data and generating a threat report;

the adjusting module is used for acquiring the historical event data and inputting a large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A user identity threat detection method based on a large language model, comprising:

Building a model based on an open source deep learning framework, obtaining historical data set input and training the model to obtain a large language model, which is used to distinguish the characteristic differences between normal behavior and threatening behavior of user identities;

A data collection crawler is customized and developed based on a web crawler framework. The network security data platform is deeply analyzed according to the large language model, and user identity threat intelligence data is obtained through the data collection crawler. The data collection crawler obtains multi-source user data from the enterprise's internal system, compresses the user identity threat intelligence data and multi-source data to obtain a collection data set, imports the collection data set into the data processing platform, and cleans and annotates the data through the large language model to obtain a labeled data set.

Parsing and extracting the labeled data set based on a cluster computing system and a large language model to obtain a feature data set, wherein the feature data set includes login features, operation features, and permission features;

Based on the word vector model, the feature data set is converted into a high-dimensional semantic vector, and the high-dimensional semantic vector is input into a large language model to construct user behavior semantic association graph data;

Inputting the feature data set and user behavior semantic association graph data into a large language model, the large language model calculating threat probability data and generating a threat report;

Initiate a preset real-time response mechanism based on the threat report, the real-time response mechanism including a response mechanism preset in the enterprise's internal system and a response mechanism preset in the network security data platform;

Historical event data is acquired and input into the large language model, and the preset parameters and real-time response mechanism of the large language model are adjusted.

2. The user identity threat detection method based on a large language model according to claim 1 is characterized in that: the model is established based on an open source deep learning framework, historical data set input is obtained and the model is trained to obtain a large language model, and the large language model is used to distinguish the characteristic differences between normal behavior and threatening behavior of user identities, including:

The historical data includes the user's historical normal behavior data and threatening behavior data obtained by the network security data platform and the user's normal behavior data and threatening behavior data obtained by the enterprise's internal system;

Historical data is input into the model and trained to obtain a large language model, wherein the model training process adopts supervised learning, normal behavior data in the historical data set is marked as 0 and threatening behavior data is marked as 1, and the model is trained and optimized based on an adaptive learning rate optimization algorithm and a cross-entropy loss function to obtain a large language model, so that the large language model can distinguish the characteristic differences between normal behavior and threatening behavior of user identity.

3. The user identity threat detection method based on a large language model according to claim 2 is characterized by: the data collection crawler is customized and developed based on a web crawler framework, the network security data platform is deeply analyzed according to the large language model, and user identity threat intelligence data is obtained through the data collection crawler, the data collection crawler obtains multi-source user data from the enterprise internal system, and the user identity threat intelligence data and multi-source data are compressed to obtain a collection data set, including:

The large language model deeply analyzes the page structure, data interface specifications, and update frequency of the network security data platform, and automatically sends data requests to the network security data platform at a preset time through the data collection crawler, and obtains user identity threat intelligence data of the network security data platform worldwide, and the user identity threat intelligence data includes malicious behavior pattern data and identity theft data;

Processing and acquiring multi-source data of users in the enterprise internal system within a preset time period based on the data acquisition crawler data, the multi-source data including user login time, user login log, user IP address and user operation behavior;

The user identity threat intelligence data, user login time, user login log, user IP address and user operation behavior are compressed to obtain a collection data set to improve the transmission efficiency of the collection data set.

4. The user identity threat detection method based on a large language model according to claim 3, characterized in that: importing the collected data set into a data processing platform and performing cleaning and data labeling using a large language model to obtain a labeled data set comprises:

Identify the collected data set according to the large language model and clean up format errors and redundant data;

The labeling tool based on the large language model labels the cleaned collected data set, labels the preset financial and privacy information in the collected data set as a high-sensitivity data set, and labels the preset routine business information in the collected data set as a low-sensitivity data set. The role data set is obtained according to the user's department information. The high-sensitivity data set, low-sensitivity data set and role data set form a labeled data set, and the labeled data set is used to determine the user behavior type.

5. The user identity threat detection method based on a large language model according to claim 4, wherein the step of extracting and parsing the labeled dataset based on a cluster computing system and a large language model to obtain a feature dataset comprises:

The labeled data set is based on a cluster computing system and a large language model to parse and extract the user's precise login time, longitude and latitude, device type, file name, file type, file operation type, change time and permission content and integrate them into a feature data set. The feature data set includes login features, operation features and permission features.

6. The user identity threat detection method based on a large language model according to claim 5, characterized in that: the method of converting the feature dataset into a high-dimensional semantic vector based on a word vector model, and inputting the high-dimensional semantic vector into the large language model to construct user behavior semantic association graph data, comprises:

Converting the user's operation features in the feature dataset into high-dimensional semantic vectors based on a word vector model;

The high-dimensional semantic vector is input into the large language model, and the cosine similarity between different high-dimensional semantic vectors is calculated based on the semantic analysis of the large language model to construct a user behavior semantic association map.

7. The user identity threat detection method based on a large language model according to claim 6, characterized in that: the inputting of the feature dataset and user behavior semantic association graph data into the large language model, the large language model calculating threat probability data and generating a threat report, comprises:

Inputting the processed feature dataset and user behavior semantic association graph data into a large language model;

If the feature data set and user behavior semantic association graph data trigger a preset abnormality threshold in the large language model, threat probability data is generated. The threat probability data is graded and classified into several levels by the large language model. The large language model generates a detailed threat report based on the graded and classified threat probability data. The threat report includes an external threat report of the network security data platform and an internal threat report of the enterprise's internal system.

Generate warning information according to the threat report, and send the warning information based on an Internet message service, wherein the warning information includes threat time, location, user and resource information.

8. The user identity threat detection method based on a large language model according to claim 7, wherein: the preset real-time response mechanism is initiated according to the threat report, and the real-time response mechanism includes a response mechanism preset in the enterprise internal system and a response mechanism preset in the network security data platform, including:

Initiating a response mechanism preset by the network security data platform based on the level classification and classification corresponding to the external threat report of the network security data platform in the threat report, wherein the response mechanism preset by the network security data platform includes freezing the relevant user account through the application programming interface of the identity authentication system;

The level classification and classification corresponding to the internal threat report of the enterprise internal system in the threat report triggers a response mechanism preset in the enterprise internal system, wherein the response mechanism preset in the enterprise internal system includes an application programming interface of the rights management system to restrict access rights and retain preset rights;

A detailed investigation process is initiated based on automated scripts to collect relevant operation logs and evidence.

9. The user identity threat detection method based on a large language model according to claim 8, wherein the steps of obtaining historical event data and inputting it into the large language model, and adjusting the preset parameters and real-time response mechanism of the large language model, comprise:

The historical event data includes successfully blocked threat event data and missed threat event data, and the preset parameters and real-time response mechanism of the large language model are adjusted according to the historical event data.

10. A user identity threat detection system based on a large language model, applied to the user identity threat detection method based on a large language model according to any one of claims 1 to 9, characterized in that the system comprises:

Establishing a training model module, wherein the training model module is used to establish a model based on an open source deep learning framework, obtain historical data set input and train the model to obtain a large language model;

A data acquisition module, wherein the data acquisition module is used to customize and develop a data acquisition crawler based on a web crawler framework, deeply analyze the network security data platform according to the large language model, and obtain user identity threat intelligence data through the data acquisition crawler. The data acquisition crawler obtains multi-source user data from the enterprise's internal system, compresses the user identity threat intelligence data and the multi-source data to obtain a collection data set, imports the collected data set into the data processing platform, cleans and annotates the data through the large language model, and obtains a labeled data set;

An extraction processing module, configured to parse and extract the labeled data set based on a cluster computing system and a large language model to obtain a feature data set;

A conversion module, which is used to convert the feature data set into a high-dimensional semantic vector based on a word vector model, and input the high-dimensional semantic vector into a large language model to construct user behavior semantic association graph data;

A calculation and generation module, the calculation and generation module is used to input the feature data set and the user behavior semantic association graph data into a large language model, the large language model calculates threat probability data and generates a threat report;

A response module, the response module is used to initiate a preset real-time response mechanism according to the threat report, the real-time response mechanism including a response mechanism preset in the enterprise internal system and a response mechanism preset in the network security data platform;

An adjustment module is used to obtain historical event data and input it into the large language model, and adjust the preset parameters and real-time response mechanism of the large language model.