[go: up one dir, main page]

CN120316757B - A user identity threat detection method and system based on large language model - Google Patents

A user identity threat detection method and system based on large language model

Info

Publication number
CN120316757B
CN120316757B CN202510821427.6A CN202510821427A CN120316757B CN 120316757 B CN120316757 B CN 120316757B CN 202510821427 A CN202510821427 A CN 202510821427A CN 120316757 B CN120316757 B CN 120316757B
Authority
CN
China
Prior art keywords
data
language model
threat
large language
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510821427.6A
Other languages
Chinese (zh)
Other versions
CN120316757A (en
Inventor
杨传磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Taihu Huiyun Data System Co ltd
Original Assignee
Jiangsu Taihu Huiyun Data System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Taihu Huiyun Data System Co ltd filed Critical Jiangsu Taihu Huiyun Data System Co ltd
Priority to CN202510821427.6A priority Critical patent/CN120316757B/en
Publication of CN120316757A publication Critical patent/CN120316757A/en
Application granted granted Critical
Publication of CN120316757B publication Critical patent/CN120316757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a user identity threat detection method and system based on a large language model in the technical field of network security, which comprises establishing a model based on an open source deep learning framework, and acquiring historical data set input and training a model to obtain a large language model, wherein the large language model is used for distinguishing the characteristic difference between the normal behavior and the threat behavior of the user identity. The invention carries out deep processing on the acquired data set through the established large language model, acquires data related to user identity threat and user multisource data in multiple aspects of a network security data platform and an enterprise internal system, constructs user behavior semantic association map data through the converted high-dimensional semantic vector, can deeply reveal the user identity threat data through multi-dimensional data analysis, realizes accurate and efficient detection and prevention of the user identity threat, effectively reduces false alarm rate and false alarm rate, reduces invalid work of security management personnel, and improves threat detection reliability.

Description

User identity threat detection method and system based on large language model
Technical Field
The invention relates to the technical field of network security, in particular to a user identity threat detection method and system based on a large language model.
Background
User identity threats refer to potential security risks or malicious behavior with respect to user identity information in a network environment. These threats may originate from a variety of sources, such as hacking, malware, phishing websites, etc., that are intended to steal, tamper with, or abuse the identity information of the user. The invention can accurately identify potential user identity threat by constructing a large language model and carrying out deep analysis on user behaviors, thereby providing powerful support for network security protection.
In the prior art, the Chinese patent with the patent number of CN202410721431.0 discloses a method for detecting abnormal behaviors of a database user based on a graphic neural network model, although learning of a complex relationship between a user and a behavior graph is realized through the graphic neural network model, abnormal behaviors can be early warned in real time, and detection accuracy is improved through multi-dimensional feature extraction, the method is mainly developed around structured data of a database operation scene, such as SQL statement analysis and data table access statistics.
The existing user identity threat detection is insufficient in consideration of multi-source data fusion, is difficult to meet analysis requirements of cross-platform user identity threats, has poor multi-source data deep analysis effect, has certain limitation on self-adaptive learning ability of dynamic threat environment, and has poor use effect.
Disclosure of Invention
The invention aims to provide a user identity threat detection method and system based on a large language model, which are used for solving the problems that the consideration on the aspect of multi-source data fusion is insufficient, the analysis requirement of cross-platform user identity threat is difficult to meet and the deep analysis effect on multi-source data is poor.
In order to achieve the above purpose, the present invention provides the following technical solutions:
In a first aspect, the present invention provides a method for detecting a threat to a user identity based on a large language model, which is characterized by comprising:
establishing a model based on an open source deep learning framework, acquiring a historical data set, inputting and training the model to obtain a large language model, wherein the large language model is used for distinguishing characteristic differences of normal behaviors and threat behaviors of a user identity;
Customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to the large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, cleaning and marking the data through the large language model to obtain a labeled data set;
analyzing, extracting and processing the labeled data set based on a cluster computing system and a large language model to obtain a characteristic data set, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics;
Converting the feature data set into a high-dimensional semantic vector based on a word vector model, and inputting the high-dimensional semantic vector into a large language model to construct user behavior semantic association map data;
Inputting the feature data set and the user behavior semantic association graph data into a large language model, and calculating the large language model to obtain threat probability data and generating a threat report;
starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;
and acquiring historical event data, inputting the large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.
The invention further provides a method for establishing a model based on an open source deep learning framework, obtaining a historical data set input and training the model to obtain a large language model, wherein the large language model is used for distinguishing the characteristic difference between normal behavior and threat behavior of a user identity, and comprises the following steps:
The historical data comprise normal behavior data and threat behavior of a user history acquired by a network security data platform and normal behavior data and threat behavior data of a user acquired by an enterprise internal system;
and inputting historical data and training the model to obtain a large language model, wherein the model training process adopts supervised learning, marks the normal behavior data in the historical data set as 0 and the threat behavior data in the historical data set as 1, trains and optimizes the model based on an adaptive learning rate optimization algorithm and a cross entropy loss function to obtain the large language model, and enables the large language model to distinguish the characteristic difference between the normal behavior and the threat behavior of the user identity.
The invention further provides a method for customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to the large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, and compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, wherein the method comprises the following steps:
The large language model deeply analyzes the page structure, the data interface specification and the updating frequency of the network security data platform, automatically sends a data sending request of the network security data platform at preset time through the data acquisition crawler, and acquires user identity threat information data of the network security data platform in the global scope, wherein the user identity threat information data comprises malicious behavior pattern data and identity embezzlement data;
Processing and acquiring user multi-source data of an enterprise internal system in a preset time period according to the data acquisition crawler data, wherein the multi-source data comprises user login time, user login log, user IP address and user operation behavior;
And compressing the user identity threat information data, the user login time, the user login log, the user IP address and the user operation behavior to obtain an acquisition data set so as to improve the transmission efficiency of the acquisition data set.
The invention further provides a method for importing the collected data set into a data processing platform and cleaning and labeling the collected data set through a large language model to obtain a labeled data set, comprising the following steps:
Identifying the collected data set according to a large language model, and cleaning format errors and repeated redundant data;
Labeling the cleaned collected data set based on the labeling tool of the large language model, labeling the preset financial and privacy information in the collected data set as a high-sensitivity data set, labeling the preset conventional business in the collected data set as a low-sensitivity data set, and obtaining a role data set according to department information of a user, wherein the high-sensitivity data set, the bottom-sensitivity data set and the role data set form a labeled data set, and the labeled data set is used for determining the behavior type of the user.
The method for extracting the feature data set based on the cluster computing system and the large language model analysis and extraction comprises the following steps:
the tagged data set is analyzed and extracted based on a cluster computing system and a large language model to obtain accurate login time, longitude and latitude, equipment type, file name, file type, file operation type, change time and authority content of a user, and the accurate login time, longitude and latitude, equipment type, file name, file operation type, change time and authority content are integrated into a characteristic data set, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics.
The invention further provides a method for converting the feature data set into a high-dimensional semantic vector based on a word vector model, wherein the high-dimensional semantic vector is input into a large language model to construct user behavior semantic association graph data, and the method comprises the following steps:
converting the operation features of the users in the feature data set into high-dimensional semantic vectors based on a word vector model;
and inputting the high-dimensional semantic vector into the large language model, calculating cosine similarity between different high-dimensional semantic vectors based on semantic analysis of the large language model, and constructing a user behavior semantic association map.
The invention further provides a method for inputting the feature data set and the user behavior semantic association map data into a large language model, wherein the large language model calculates threat probability data and generates a threat report, and the method comprises the following steps:
inputting the processed characteristic data set and the user behavior semantic association map data into a large language model;
If the feature data set and the user behavior semantic association map data trigger a preset abnormal threshold value in a large language model, threat probability data are generated, the threat probability data are classified and classified in a plurality of grades through the large language model, the large language model generates detailed threat reports according to the classified threat probability data, and the threat reports comprise external threat reports of a network security data platform and internal threat reports of an enterprise internal system;
generating early warning information according to the threat report, and sending the early warning information based on message service of the Internet, wherein the early warning information comprises threat time, place, user and resource information.
The method comprises the steps of starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform, and the method comprises the following steps:
Starting a response mechanism preset by the network security data platform according to the grade grading and classification corresponding to the external threat report of the network security data platform in the threat report, wherein the response mechanism preset by the network security data platform comprises an application programming interface freezing related user account number of an identity authentication system;
The method comprises the steps that a response mechanism preset by an enterprise internal system is started by the corresponding grade grading and classification of the internal threat report of the enterprise internal system in the threat report, wherein the response mechanism preset by the enterprise internal system comprises access permission limitation of an application programming interface of a permission management system, and preset permission is reserved;
And starting a detailed investigation flow based on the automation script, and collecting relevant operation logs and evidences.
The method for acquiring the historical event data and inputting the large language model, and adjusting the preset parameters and the real-time response mechanism of the large language model comprises the following steps:
the historical event data are successfully prevented threat event data and missed threat event data, and parameters and a real-time response mechanism preset by the large language model are adjusted according to the historical event data.
In a second aspect, the present invention provides a large language model-based user identity threat detection system, the system comprising:
The training model building module is used for building a model based on an open source deep learning framework, acquiring a historical data set input and training the model to obtain a large language model;
The data acquisition module is used for customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to the large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, and cleaning and labeling the data through the large language model to obtain a labeled data set;
The extraction processing module is used for analyzing, extracting and processing the tagged data set based on a cluster computing system and a large language model to obtain a characteristic data set;
the conversion module is used for converting the characteristic data set into a high-dimensional semantic vector based on a word vector model, and the high-dimensional semantic vector is input into a large language model to construct user behavior semantic association map data;
The calculation generation module is used for inputting the characteristic data set and the user behavior semantic association map data into a large language model, calculating threat probability data by the large language model and generating a threat report;
the response module is used for starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;
the adjustment module is used for acquiring historical event data and inputting the large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.
Compared with the prior art, the invention has the beneficial effects that:
1. In the invention, the collected data set is deeply processed through the established large language model, the data related to the user identity threat and the user multi-source data are acquired in multiple aspects of a network security data platform and an enterprise internal system, the user behavior semantic association map data are constructed through the converted high-dimensional semantic vector, the user identity threat data can be deeply revealed through multi-dimensional data analysis, the accurate and efficient detection and prevention of the user identity threat are realized, the false alarm rate and the false alarm rate are effectively reduced, the invalid work of security management personnel is reduced, and the threat detection reliability is improved.
2. According to the invention, by starting the preset real-time response mechanism according to the threat report, the system can quickly react when the potential threat is detected, so that the delay and uncertainty of manual intervention are avoided, the timeliness and accuracy of threat response are effectively improved, the cooperative effect of the preset response mechanism of the internal system of the enterprise and the preset response mechanism of the network safety data platform is ensured, threat information can be efficiently transmitted and processed across platforms, the overall safety performance of the system is further enhanced, threat response and preset parameter adjustment are intelligently optimized through historical event data, a large language model can be adaptively learned to continuously adapt to dynamic threat environment, and more comprehensive and effective guarantee is provided for user identity safety.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
referring to fig. 1, in an embodiment of the present invention, a method for detecting a threat of a user identity based on a large language model includes:
S1, establishing a model based on an open source deep learning framework, acquiring a historical data set input, and training the model to obtain a large language model, wherein the large language model is used for distinguishing characteristic differences of normal behaviors and threat behaviors of a user identity;
S2, customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to a large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, and cleaning and labeling the data through the large language model to obtain a labeled data set;
S3, analyzing, extracting and processing the labeled data set based on a cluster computing system and a large language model to obtain a characteristic data set, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics;
S4, converting the feature data set into a high-dimensional semantic vector based on the word vector model, and inputting the high-dimensional semantic vector into a large language model to construct user behavior semantic association map data;
S5, inputting the feature data set and the user behavior semantic association map data into a large language model, and calculating the large language model to obtain threat probability data and generating a threat report;
S6, starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;
s7, acquiring historical event data, inputting a large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.
Specifically, in the invention, through the natural language processing and knowledge reasoning capability of the established large language model, the collected data set is subjected to advanced treatment, the data related to the user identity threat and the user multi-source data are acquired in multiple aspects of a network security data platform and an enterprise internal system, the user behavior semantic association map data are constructed through the converted high-dimensional semantic vector, the user identity threat data can be deeply revealed through multi-dimensional data analysis, the accurate and efficient detection and prevention of the user identity threat are realized, the false alarm rate and the false alarm rate are effectively reduced, the invalid work of security management personnel is reduced, and the reliability of threat detection is improved;
Furthermore, by starting a preset real-time response mechanism according to the threat report, the system can quickly react when the potential threat is detected, and delay and uncertainty of manual intervention are avoided, so that timeliness and accuracy of threat response are effectively improved, the cooperative effect of the response mechanism preset by the internal system of the enterprise and the response mechanism preset by the network security data platform is ensured, threat information can be efficiently transmitted and processed across platforms, overall security performance of the system is further enhanced, threat response and preset parameter adjustment are intelligently optimized through historical event data, a large language model can be adaptively learned to continuously adapt to dynamic threat environments, and more comprehensive and effective guarantee is provided for user identity security.
Preferably, a model is built based on an open source deep learning framework, a historical data set is acquired and input and the model is trained to obtain a large language model, and the large language model is used for distinguishing characteristic differences between normal behaviors and threat behaviors of a user identity, and the method comprises the following steps:
The open source deep learning framework comprises PyTorch framework building models, wherein PyTorch is an open source deep learning framework, such as GPT series, and the language modes and rules are learned by performing unsupervised learning on large-scale text data, and the historical data comprise normal behavior data and threat behavior of a user history acquired by a network security data platform and normal behavior data and threat behavior data of a user acquired by an enterprise internal system;
And inputting the historical data and training the model to obtain a large language model, wherein the model training process adopts supervised learning, marks the normal behavior data in the historical data set as 0 and the threat behavior data as 1, trains the optimization model based on the self-adaptive learning rate optimization algorithm and the cross entropy loss function to obtain the large language model, and makes the large language model distinguish the characteristic difference of the normal behavior and the threat behavior of the user identity.
Specifically, the adaptive learning rate optimization algorithm comprises an Adam optimizer, and the Adam optimizer has wide application in various deep learning tasks, such as image classification, target detection, natural language processing and the like. Many deep learning frameworks provide the implementation of an Adam optimizer, are convenient for users to use, mark normal behaviors as 0, mark threat behaviors as 1, train 100 rounds by adopting the Adam optimizer and a cross entropy loss function, and avoid overfitting by an early stop method so that the model learns the characteristic difference of distinguishing normal and threat behaviors.
Preferably, based on the web crawler framework customization development data acquisition crawler, according to the deep analysis network security data platform of big language model and obtain user's identity threat intelligence data through data acquisition crawler, data acquisition crawler obtains the user multisource data of enterprise's internal system, compresses user's identity threat intelligence data and multisource data and obtains the collection dataset, includes:
The method comprises the steps that a large language model deeply analyzes the page structure, data interface specifications and updating frequency of a network security data platform, automatically sends a data sending request of the network security data platform at preset time through a data acquisition crawler, and acquires user identity threat information data of the network security data platform in the global scope, wherein the user identity threat information data comprises malicious behavior pattern data and identity theft data;
Processing and acquiring user multi-source data of an enterprise internal system in a preset time period according to the data acquisition crawler data, wherein the multi-source data comprises user login time, user login log, user IP address and user operation behavior;
The user identity threat information data, the user login time, the user login log, the user IP address and the user operation behavior are compressed to obtain an acquisition data set, so that the transmission efficiency of the acquisition data set is improved, and the full coverage of the acquisition data set in a network security data platform and an enterprise internal system and the updating of the acquisition data set in a preset time are ensured.
The web crawler framework comprises a Scrapy framework of Python, the Scrapy framework of Python is used for customizing and developing an efficient data acquisition crawler, the large language model deeply analyzes the page structure, the data interface specification and the updating frequency of the network security data platform, the network security data platform comprises KASPERSKY THREAT INTELLIGENCE Portal, symantec Security Response and the like, and the large language model can accurately understand and analyze the complex page structure of the network security data platform through the strong natural language processing capability of the network security data platform, so that the accuracy and the integrity of data acquisition are ensured. Meanwhile, the data interface specification of the platform is carefully researched, so that the data acquisition crawler can efficiently and stably interact with the platform. In addition, the large language model closely focuses on the updating frequency of the platform, and the data acquisition strategy is timely adjusted to cope with the possible change of the page structure or the data interface of the platform, so that the continuity and timeliness of data acquisition are ensured, the preset time is 5-30 minutes, preferably 10 minutes, and the latest user identity threat information in the global scope is acquired;
The data acquisition crawler data comprises an ETL tool, wherein the ETL tool is an abbreviation of Extract, transform and Load, namely a process of data extraction, conversion and loading. The data acquisition is carried out in a preset time period, wherein the preset time period is a time period from 2 a.m. to 6 a.m., and the ETL tool is responsible for acquiring the data such as user login time, IP address, operation behavior and the like from an enterprise internal system (such as a network access log and a rights management system) in a full quantity. Through effective application of the ETL tool, the efficiency and accuracy of data acquisition can be greatly improved, and a solid data basis is provided for subsequent user identity threat detection.
Taking KASPERSKY THREAT INTELLIGENCE Portal as an example, the large language model analyzes the data interface document, determines parameters and formats of the data request, automatically sends the data request to the platform every 10 minutes by utilizing Scrapy scheduler setting, and acquires the latest user identity threat information in the global scope, including data such as malicious behavior patterns, identity theft cases and the like. Meanwhile, the ETL tool is utilized to collect data such as user login time, IP address, operation behavior and the like from an enterprise internal system (such as a network access log and a right management system) in a full quantity at 2 early morning with lower system load, and the transmission efficiency is improved through a data compression technology, so that the full coverage and timely updating of internal and external data are ensured.
Preferably, importing the collected data set into a data processing platform, cleaning and labeling the collected data set through a large language model to obtain a labeled data set, including:
identifying the collected data set according to the large language model, and cleaning format errors and repeated redundant data;
the method comprises the steps that a cleaned collected data set is marked by a marking tool based on a large language model, preset financial and privacy information in the collected data set is marked as a high-sensitivity data set, preset conventional business related in the collected data set is taken as a bottom-sensitivity data set, a role data set is obtained according to department information of a user, the high-sensitivity data set, the bottom-sensitivity data set and the role data set form a labeled data set, and the labeled data set is used for determining a user behavior type.
Specifically, the data processing platform comprises a Hadoop platform, which is an open-source, expandable and distributed computing platform developed by the Apache software foundation. The method is mainly used for processing large-scale data sets, can run on a cluster formed by a large number of common servers, and provides a high-efficiency and reliable solution for storing and processing mass data;
And importing the acquired data set into a Hadoop platform, identifying and cleaning format errors and repeated redundant data according to a large language model by combining regular expressions and semantic analysis, and marking cleaned data by a part-of-speech marking and named entity identification technology based on a marking tool developed by the large language model. By taking user operation behavior data as an example, identifying operation verbs (such as login, reading and modifying) by using a part-of-speech labeling technology to determine the user behavior type, identifying entities such as user names, file names and the like by using a named entity identification technology, and judging the sensitivity degree of the data by combining the understanding of a large language model to business logic. The preset routine business comprises routine operation of common business files, routine operation of the common business files, low sensitivity is marked to obtain a low-sensitivity data set, a user role (such as common staff, department manager, system manager and the like) is determined by utilizing knowledge graph technology of a large language model according to department and position information of the user to obtain a role data set, and clear and accurate labeled data is provided for subsequent feature extraction and analysis
Preferably, the feature data set is obtained by analyzing, extracting and processing the labeled data set based on a cluster computing system and a large language model, and the method comprises the following steps:
the tagged data set is analyzed and extracted based on a cluster computing system and a large language model to obtain accurate login time, longitude and latitude, equipment type, file name, file type, file operation type, change time and authority content of a user, and the accurate login time, longitude and latitude, equipment type, file name, file type, file operation type, change time and authority content are integrated into a characteristic data set which is processed and obtained, wherein the characteristic data set comprises login characteristics, operation characteristics and authority characteristics.
The cluster computing system comprises a Spark platform and a pandas library, processes the labeled dataset by using the Spark platform and the pandas library, and extracts basic features by combining the large language model analysis capability. The tagged data set includes the resolved precise login time, longitude and latitude (call GeoPy tool) and device type in the login log. And extracting the file name, the type and the operation type from the file operation records included in the labeled data set. Change time, rights content and the like are extracted from rights change records included in the tagged data set. The data set containing three basic characteristics of login characteristics, operation characteristics and authority characteristics is integrated and formed, and support is provided for deep analysis.
Preferably, the feature data set is converted into a high-dimensional semantic vector based on the word vector model, and the high-dimensional semantic vector is input into the large language model to construct user behavior semantic association graph data, which comprises the following steps:
converting the operation features of the users in the feature data set into high-dimensional semantic vectors based on the word vector model;
the high-dimensional semantic vectors are input into a large language model, cosine similarity among different high-dimensional semantic vectors is calculated based on semantic analysis of the large language model, and a user behavior semantic association map is constructed.
The map can intuitively display the relevance and the mode between the behaviors of the user and help identify the abnormal behaviors. By setting a threshold, when the cosine similarity is lower than or higher than a preset value, an alarm mechanism is triggered to prompt the possible identity threat. In addition, the map can be dynamically updated, and the user behavior model is continuously optimized along with the addition of new data, so that the detection accuracy and timeliness are improved.
Specifically, the word vector model is based on a BERT word vector model, the BERT word vector technology is adopted to convert the text of the operation features in the feature data set into a high-dimensional semantic vector (e.g. "the user Zhang San10 10 days 2024 reads the financial quarter report. Docx file" generates 768-dimensional vector), and calculating cosine similarity between different operation behavior vectors by using semantic analysis capability of a large language model, and constructing a user behavior semantic association map. A similarity threshold of 0.7 is set to identify behavior pattern association, and when the similarity of two operation behavior vectors exceeds the threshold, the two behaviors are considered to have stronger association semantically. For example, if a user frequently reads a financial statement class file, when a new operation of reading a financial related file occurs, the similarity between the behavior vector and the historical behavior vector is higher, which indicates that the behavior accords with the normal behavior mode of the user, otherwise, if the similarity is lower, an abnormality may exist. By constructing the semantic association graph, potential changes and associations of the user behavior patterns can be found, and deeper semantic feature basis is provided for threat detection.
Training a user behavior portrayal algorithm based on a large language model, and determining a user routine login period through time sequence analysis. For example, by analyzing login time data of a user over one month, it was found that the user a generally logged in the system between 9 to 18 points on the workday, and the login peak was concentrated at 9 to 10 points 30 minutes. The type and the frequency of the user accessing the file are statistically analyzed, and the service requirement and the working habit of the user are mastered by combining the semantic understanding of the large language model on the file content. If a user frequently accesses sales data report class files and frequently reads and analyzes the files, this indicates that the user's work may be relevant to sales data analysis. And combining the role and authority information of the user, and refining and dynamically adjusting the behavior portraits of the user by utilizing the knowledge reasoning capability of the large language model. For example, a system administrator user has higher authority, and its operation behavior may involve advanced operations such as system configuration change, user authority management, etc., while a general staff user has lower authority, mainly performing daily business operations. When the role or authority of the user changes, the model updates the user behavior portraits in real time according to the new information, so that the portraits can accurately reflect the normal behavior mode of the user.
And carrying out statistical analysis on the historical data by using a large language model, setting personalized abnormal thresholds for each user based on a 3 sigma principle, wherein the login place deviates from a resident place by more than 5 degrees, the login time deviates from a mean value by 3 times, the standard deviation is regarded as abnormal, the file access frequency exceeds the historical mean value by 3 times or abnormal operation types (such as ordinary staff try permission modification) trigger early warning, and the thresholds are used as important reference standards for subsequent threat detection and are used for judging whether the user behavior deviates from a normal mode.
Preferably, the feature data set and the user behavior semantic association map data are input into a large language model, threat probability data are calculated by the large language model, and a threat report is generated, including:
Inputting the processed feature data set and the user behavior semantic association map data into a large language model;
If the feature data set and the user behavior semantic association graph data trigger a preset abnormal threshold value in the large language model, threat probability data are generated, the threat probability data are classified and classified in a plurality of grades through the large language model, the large language model generates a detailed threat report according to the classified threat probability data, and the threat report comprises an external threat report of a network security data platform and an internal threat report of an enterprise internal system;
And generating early warning information according to the threat report, and sending the early warning information based on the message service of the Internet, wherein the early warning information comprises threat time, place, user and resource information.
Specifically, the feature data set collected in real time and the user behavior semantic association map data are input into a trained large language model to trigger a preset abnormal threshold value, threat probability is calculated, if the abnormal threshold value is triggered (for example, 3 different places are logged in for 10 minutes, abnormal equipment is accessed in an unauthorized mode), the message service of the Internet comprises a short message gateway and a mail service, the system sends early warning to security personnel through the short message gateway and the mail service, threat time (accurate to seconds), place (IP is analyzed to street level), user and resource information are included, and timely response is ensured.
When a threat is detected, the threat is classified and ranked using a large language model. The threats are classified into different categories such as identity theft threat of external hackers, malicious operation threat of internal staff, authority abuse risk threat, account abnormal sharing threat and the like according to the nature, source and possible influence of the threats. Through analysis of threat behavior characteristics, a text classification technology of a large language model is applied to determine the category to which the threat belongs. For example, for the case that a large number of violent login attempts are initiated by an external IP address and the account used for login is a common service account, a large language model is utilized to analyze the behavior mode and the characteristics of the account, and the identity theft threat of an external hacker is judged. Meanwhile, the threat is classified into a plurality of grades according to factors such as data sensitivity degree related to the threat, affected service range, potential economic loss and the like, wherein the plurality of grades comprise high, medium and low. The method comprises the steps of judging the threat of serious economic loss caused by normal operation of core business involving high sensitive data (such as customer identification card number and bank card information) leakage risk as high-grade threat, judging the threat of small economic loss as medium-grade threat while influencing the operation of business involving general business data leakage, and judging the threat of small influence on business as low-grade threat while only involving a small amount of non-sensitive data abnormal operation. So that the security manager takes corresponding countermeasures according to the severity of the threat.
The large language model generates detailed threat reports according to threat features and related knowledge, wherein the threat reports comprise threat details such as violent login times, abnormal operation content, hazard assessment such as data security and business influence, and suggested countermeasures such as frozen account numbers and IP tracking. For example, the external theft threat report needs to include the attack IP attribution, the affected account authority level and the emergency treatment flow, and assist security personnel in quickly making a scientific treatment scheme.
Preferably, a preset real-time response mechanism is started according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an internal system of an enterprise and a response mechanism preset by a network security data platform, and the method comprises the following steps:
starting a response mechanism preset by the network security data platform according to the grade grading and classification corresponding to the external threat report of the network security data platform in the threat report, wherein the response mechanism preset by the network security data platform comprises an application programming interface freezing related user account number of an identity authentication system;
The method comprises the steps that a response mechanism preset by an enterprise internal system is started by the corresponding grade grading and classification of the internal threat report of the enterprise internal system in the threat report, the response mechanism preset by the enterprise internal system comprises access permission limitation of an application programming interface of a permission management system, and preset permission is reserved;
And starting a detailed investigation flow based on the automation script, and collecting relevant operation logs and evidences.
Specifically, after threat early warning is received, the system automatically starts a real-time response mechanism. Different countermeasures are taken depending on the threat category and severity. For high risk external hacker identity theft threats, the system immediately freezes the relevant user account through the application programming interface (API interface) of the identity authentication system, preventing further operation thereof. Meanwhile, a short message gateway interface is called to send emergency notification containing threat details to a mobile phone of a security manager, for example, an external hacker is found to perform violent login attack on a user [ user name ] account, the account is frozen, and timely processing is requested. For the malicious operation threat of the internal staff, the system limits the access rights of the staff by using an application programming interface (API interface) of a rights management system (such as Windows SERVER ACTIVE direct rights management), only the preset rights are reserved, and the preset rights comprise basic necessary rights (such as the rights for viewing personal work tasks). And starting a detailed investigation flow through an automation script, and collecting relevant operation logs and evidences. For example, all operation records of the employee before and after the threat are extracted from the network access log system and the business operation record system and stored in a special investigation database for subsequent deep analysis.
Preferably, the method for acquiring the historical event data and inputting the historical event data into the large language model, and adjusting the preset parameters and the real-time response mechanism of the large language model comprises the following steps:
The historical event data is successfully prevented threat event data and missed threat event data, and parameters and a real-time response mechanism preset by the large language model are adjusted according to the historical event data.
Specifically, the large language model analyzes the historical events and summarizes the successful defensive experience and failed training. For example, a case of successfully preventing external hacking is analyzed, the characteristics of attack behaviors, detection processes and response measures are deeply analyzed by using a large language model, and effective detection rules and response strategies are summarized, such as timely discovery of abnormal login behaviors and rapid account freezing. Analyzing a certain false negative threat event to find out the defect of the model in the detection process, wherein if a certain detection rule is too loose, the threat cannot be found in time. These experiences and training are fed back into the threat detection model and response mechanism, and the parameters and response strategies of the model are adjusted. If it is found that a certain detection threshold is unreasonably set to cause missing report, the large language model can recalculate and adjust the threshold according to the historical data. And analyzing a large amount of historical data and current threat information by using a large language model to predict potential threat trend. Novel threat modes and attack means which may occur in the future, such as novel identity theft attacks which bypass detection by using artificial intelligence technology, are predicted through time series analysis, machine learning algorithms (such as decision trees and random forests) and knowledge reasoning capability of large language models. Corresponding defense strategies, such as developing new detection algorithms, updating defense rules and the like, are formulated in advance, so that active defense is realized.
A user identity threat detection system based on a large language model, which comprises the following steps:
The training model building module is used for building a model based on an open source deep learning framework, acquiring historical data set input and training the model to obtain a large language model;
The data acquisition module is used for customizing and developing a data acquisition crawler based on a web crawler framework, deeply analyzing a web security data platform according to a large language model, acquiring user identity threat information data through the data acquisition crawler, acquiring user multi-source data of an enterprise internal system by the data acquisition crawler, compressing the user identity threat information data and the multi-source data to obtain an acquisition data set, importing the acquisition data set into a data processing platform, and cleaning and marking the data through the large language model to obtain a labeled data set;
The extraction processing module is used for analyzing, extracting and processing the tagged data set based on the cluster computing system and the large language model to obtain a characteristic data set;
The conversion module is used for converting the feature data set into a high-dimensional semantic vector based on the word vector model, and the high-dimensional semantic vector is input into the large language model to construct user behavior semantic association map data;
The calculation generation module is used for inputting the feature data set and the user behavior semantic association graph data into a large language model, calculating the large language model to obtain threat probability data and generating a threat report;
The response module is used for starting a preset real-time response mechanism according to the threat report, wherein the real-time response mechanism comprises a response mechanism preset by an enterprise internal system and a response mechanism preset by a network security data platform;
the adjusting module is used for acquiring the historical event data and inputting a large language model, and adjusting preset parameters and a real-time response mechanism of the large language model.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (10)

1.一种基于大语言模型的用户身份威胁检测方法,其特征在于,包括:1. A user identity threat detection method based on a large language model, comprising: 基于开源深度学习框架建立模型,获取历史数据集输入并训练所述模型得到大语言模型,所述大语言模型用于区分用户身份的正常行为与威胁行为的特征差异;Building a model based on an open source deep learning framework, obtaining historical data set input and training the model to obtain a large language model, which is used to distinguish the characteristic differences between normal behavior and threatening behavior of user identities; 基于网络爬虫框架定制开发数据采集爬虫,根据所述大语言模型深度分析网络安全数据平台并通过所述数据采集爬虫获取用户身份威胁情报数据,所述数据采集爬虫获取企业内部系统的用户多源数据,将所述用户身份威胁情报数据和多源数据进行压缩得到采集数据集,将所述采集数据集导入数据处理平台并通过大语言模型进行清洗和数据标注得到标签化数据集;A data collection crawler is customized and developed based on a web crawler framework. The network security data platform is deeply analyzed according to the large language model, and user identity threat intelligence data is obtained through the data collection crawler. The data collection crawler obtains multi-source user data from the enterprise's internal system, compresses the user identity threat intelligence data and multi-source data to obtain a collection data set, imports the collection data set into the data processing platform, and cleans and annotates the data through the large language model to obtain a labeled data set. 将所述标签化数据集基于集群计算系统和大语言模型解析提取处理得到特征数据集,其中,所述特征数据集包含登录特征、操作特征和权限特征;Parsing and extracting the labeled data set based on a cluster computing system and a large language model to obtain a feature data set, wherein the feature data set includes login features, operation features, and permission features; 基于词向量模型将所述特征数据集转化为高维语义向量,所述高维语义向量输入大语言模型构建用户行为语义关联图谱数据;Based on the word vector model, the feature data set is converted into a high-dimensional semantic vector, and the high-dimensional semantic vector is input into a large language model to construct user behavior semantic association graph data; 将所述特征数据集和用户行为语义关联图谱数据输入大语言模型,所述大语言模型计算得到威胁概率数据并生成威胁报告;Inputting the feature data set and user behavior semantic association graph data into a large language model, the large language model calculating threat probability data and generating a threat report; 根据所述威胁报告启动预设的实时响应机制,所述实时响应机制包括企业内部系统预设的响应机制和网络安全数据平台预设的响应机制;Initiate a preset real-time response mechanism based on the threat report, the real-time response mechanism including a response mechanism preset in the enterprise's internal system and a response mechanism preset in the network security data platform; 获取历史事件数据并输入所述大语言模型,调整所述大语言模型预设的参数和实时响应机制。Historical event data is acquired and input into the large language model, and the preset parameters and real-time response mechanism of the large language model are adjusted. 2.根据权利要求1所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述基于开源深度学习框架建立模型,获取历史数据集输入并训练所述模型得到大语言模型,所述大语言模型用于区分用户身份的正常行为与威胁行为的特征差异,包括:2. The user identity threat detection method based on a large language model according to claim 1 is characterized in that: the model is established based on an open source deep learning framework, historical data set input is obtained and the model is trained to obtain a large language model, and the large language model is used to distinguish the characteristic differences between normal behavior and threatening behavior of user identities, including: 所述历史数据包括网络安全数据平台获取的用户历史的正常行为数据与威胁行为和企业内部系统获取的用户的正常行为数据与威胁行为数据;The historical data includes the user's historical normal behavior data and threatening behavior data obtained by the network security data platform and the user's normal behavior data and threatening behavior data obtained by the enterprise's internal system; 将历史数据输入并训练所述模型得到大语言模型,其中,所述模型训练过程采用监督学习,将历史数据集中的正常行为数据标记为0与威胁行为数据标记为1,基于自适应学习率优化算法与交叉熵损失函数训练优化所述模型,得到大语言模型,使所述大语言模型区分用户身份的正常行为与威胁行为的特征差异。Historical data is input into the model and trained to obtain a large language model, wherein the model training process adopts supervised learning, normal behavior data in the historical data set is marked as 0 and threatening behavior data is marked as 1, and the model is trained and optimized based on an adaptive learning rate optimization algorithm and a cross-entropy loss function to obtain a large language model, so that the large language model can distinguish the characteristic differences between normal behavior and threatening behavior of user identity. 3.根据权利要求2所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述基于网络爬虫框架定制开发数据采集爬虫,根据所述大语言模型深度分析网络安全数据平台并通过所述数据采集爬虫获取用户身份威胁情报数据,所述数据采集爬虫获取企业内部系统的用户多源数据,将所述用户身份威胁情报数据和多源数据进行压缩得到采集数据集,包括:3. The user identity threat detection method based on a large language model according to claim 2 is characterized by: the data collection crawler is customized and developed based on a web crawler framework, the network security data platform is deeply analyzed according to the large language model, and user identity threat intelligence data is obtained through the data collection crawler, the data collection crawler obtains multi-source user data from the enterprise internal system, and the user identity threat intelligence data and multi-source data are compressed to obtain a collection data set, including: 所述大语言模型深度分析网络安全数据平台的页面结构、数据接口规范和更新频率,并通过所述数据采集爬虫在预设时间自动发送网络安全数据平台的发送数据请求,并获取全球范围内网络安全数据平台的用户身份威胁情报数据,所述用户身份威胁情报数据包括恶意行为模式数据和身份盗用数据;The large language model deeply analyzes the page structure, data interface specifications, and update frequency of the network security data platform, and automatically sends data requests to the network security data platform at a preset time through the data collection crawler, and obtains user identity threat intelligence data of the network security data platform worldwide, and the user identity threat intelligence data includes malicious behavior pattern data and identity theft data; 根据所述数据采集爬虫数据在预设时间段内处理获取企业内部系统的用户多源数据,所述多源数据包括用户登录时间、用户登录日志、用户IP地址和用户操作行为;Processing and acquiring multi-source data of users in the enterprise internal system within a preset time period based on the data acquisition crawler data, the multi-source data including user login time, user login log, user IP address and user operation behavior; 将所述用户身份威胁情报数据、用户登录时间、用户登录日志、用户IP地址和用户操作行为进行压缩得到采集数据集,以提升所述采集数据集传输效率。The user identity threat intelligence data, user login time, user login log, user IP address and user operation behavior are compressed to obtain a collection data set to improve the transmission efficiency of the collection data set. 4.根据权利要求3所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述将所述采集数据集导入数据处理平台并通过大语言模型进行清洗和数据标注得到标签化数据集,包括:4. The user identity threat detection method based on a large language model according to claim 3, characterized in that: importing the collected data set into a data processing platform and performing cleaning and data labeling using a large language model to obtain a labeled data set comprises: 将所述采集数据集根据大语言模型进行识别并清洗格式错误、重复冗余数据;Identify the collected data set according to the large language model and clean up format errors and redundant data; 基于所述大语言模型的标注工具将清洗后的采集数据集进行标注,将所述采集数据集中涉及预设的财务和隐私信息标注为高敏感数据集,将所述采集数据集中涉及预设常规业务为低敏感数据集,根据用户的部门信息得到角色数据集,所述高敏感数据集、低敏感数据集和角色数据集形成标签化数据集,所述标签化数据集用于确定用户行为类型。The labeling tool based on the large language model labels the cleaned collected data set, labels the preset financial and privacy information in the collected data set as a high-sensitivity data set, and labels the preset routine business information in the collected data set as a low-sensitivity data set. The role data set is obtained according to the user's department information. The high-sensitivity data set, low-sensitivity data set and role data set form a labeled data set, and the labeled data set is used to determine the user behavior type. 5.根据权利要求4所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述将所述标签化数据集基于集群计算系统和大语言模型解析提取处理得到特征数据集,包括:5. The user identity threat detection method based on a large language model according to claim 4, wherein the step of extracting and parsing the labeled dataset based on a cluster computing system and a large language model to obtain a feature dataset comprises: 所述标签化数据集基于集群计算系统和大语言模型解析提取得到用户的精确登录时间、经纬度、设备类型、文件名、文件类型、文件操作类型、变更时间和权限内容并整合成处理得到特征数据集,所述特征数据集包含登录特征、操作特征和权限特征。The labeled data set is based on a cluster computing system and a large language model to parse and extract the user's precise login time, longitude and latitude, device type, file name, file type, file operation type, change time and permission content and integrate them into a feature data set. The feature data set includes login features, operation features and permission features. 6.根据权利要求5所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述基于词向量模型将所述特征数据集转化为高维语义向量,所述高维语义向量输入大语言模型构建用户行为语义关联图谱数据,包括:6. The user identity threat detection method based on a large language model according to claim 5, characterized in that: the method of converting the feature dataset into a high-dimensional semantic vector based on a word vector model, and inputting the high-dimensional semantic vector into the large language model to construct user behavior semantic association graph data, comprises: 基于词向量模型将所述特征数据集中用户的操作特征转化为高维语义向量;Converting the user's operation features in the feature dataset into high-dimensional semantic vectors based on a word vector model; 所述高维语义向量输入所述大语言模型,基于所述大语言模型的语义分析计算不同高维语义向量之间的余弦相似度,构建用户行为语义关联图谱。The high-dimensional semantic vector is input into the large language model, and the cosine similarity between different high-dimensional semantic vectors is calculated based on the semantic analysis of the large language model to construct a user behavior semantic association map. 7.根据权利要求6所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述将所述特征数据集和用户行为语义关联图谱数据输入大语言模型,所述大语言模型计算得到威胁概率数据并生成威胁报告,包括:7. The user identity threat detection method based on a large language model according to claim 6, characterized in that: the inputting of the feature dataset and user behavior semantic association graph data into the large language model, the large language model calculating threat probability data and generating a threat report, comprises: 将处理后的所述特征数据集和用户行为语义关联图谱数据输入大语言模型;Inputting the processed feature dataset and user behavior semantic association graph data into a large language model; 若所述特征数据集和用户行为语义关联图谱数据触发大语言模型中预设异常阈值,则生成威胁概率数据,通过所述大语言模型将威胁概率数据进行若干等级分级和分类,所述大语言模型根据分级和分类后的威胁概率数据生成详细的威胁报告,所述威胁报告包括网络安全数据平台的外部威胁报告和企业内部系统的内部威胁报告;If the feature data set and user behavior semantic association graph data trigger a preset abnormality threshold in the large language model, threat probability data is generated. The threat probability data is graded and classified into several levels by the large language model. The large language model generates a detailed threat report based on the graded and classified threat probability data. The threat report includes an external threat report of the network security data platform and an internal threat report of the enterprise's internal system. 根据所述威胁报告生成预警信息,并基于互联网的消息服务发送所述预警信息,其中,所述预警信息包括威胁时间、地点和用户及资源信息。Generate warning information according to the threat report, and send the warning information based on an Internet message service, wherein the warning information includes threat time, location, user and resource information. 8.根据权利要求7所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述根据所述威胁报告启动预设的实时响应机制,所述实时响应机制包括企业内部系统预设的响应机制和网络安全数据平台预设的响应机制,包括:8. The user identity threat detection method based on a large language model according to claim 7, wherein: the preset real-time response mechanism is initiated according to the threat report, and the real-time response mechanism includes a response mechanism preset in the enterprise internal system and a response mechanism preset in the network security data platform, including: 根据所述威胁报告中网络安全数据平台的外部威胁报告对应的等级分级和分类启动网络安全数据平台预设的响应机制,所述网络安全数据平台预设的响应机制包括身份认证系统的应用程序编程接口冻结相关用户账号;Initiating a response mechanism preset by the network security data platform based on the level classification and classification corresponding to the external threat report of the network security data platform in the threat report, wherein the response mechanism preset by the network security data platform includes freezing the relevant user account through the application programming interface of the identity authentication system; 所述威胁报告中企业内部系统的内部威胁报告对应的等级分级和分类启动企业内部系统预设的响应机制,所述企业内部系统预设的响应机制包括权限管理系统的应用程序编程接口限制访问权限,保留预设的权限;The level classification and classification corresponding to the internal threat report of the enterprise internal system in the threat report triggers a response mechanism preset in the enterprise internal system, wherein the response mechanism preset in the enterprise internal system includes an application programming interface of the rights management system to restrict access rights and retain preset rights; 并基于自动化脚本启动详细的调查流程,收集相关操作日志和证据。A detailed investigation process is initiated based on automated scripts to collect relevant operation logs and evidence. 9.根据权利要求8所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述获取历史事件数据并输入所述大语言模型,调整所述大语言模型预设的参数和实时响应机制,包括:9. The user identity threat detection method based on a large language model according to claim 8, wherein the steps of obtaining historical event data and inputting it into the large language model, and adjusting the preset parameters and real-time response mechanism of the large language model, comprise: 所述历史事件数据为成功阻止威胁事件数据和漏报威胁事件数据,根据所述历史事件数据调整所述大语言模型预设的参数和实时响应机制。The historical event data includes successfully blocked threat event data and missed threat event data, and the preset parameters and real-time response mechanism of the large language model are adjusted according to the historical event data. 10.一种基于大语言模型的用户身份威胁检测系统,应用于权利要求1-9任意一项所述的基于大语言模型的用户身份威胁检测方法,其特征在于:所述系统包括:10. A user identity threat detection system based on a large language model, applied to the user identity threat detection method based on a large language model according to any one of claims 1 to 9, characterized in that the system comprises: 建立训练模型模块,所述建立训练模型模块用于基于开源深度学习框架建立模型,获取历史数据集输入并训练所述模型得到大语言模型;Establishing a training model module, wherein the training model module is used to establish a model based on an open source deep learning framework, obtain historical data set input and train the model to obtain a large language model; 数据获取模块,所述数据获取模块用于基于网络爬虫框架定制开发数据采集爬虫,根据所述大语言模型深度分析网络安全数据平台并通过所述数据采集爬虫获取用户身份威胁情报数据,所述数据采集爬虫获取企业内部系统的用户多源数据,将所述用户身份威胁情报数据和多源数据进行压缩得到采集数据集,将所述采集数据集导入数据处理平台并通过大语言模型进行清洗和数据标注得到标签化数据集;A data acquisition module, wherein the data acquisition module is used to customize and develop a data acquisition crawler based on a web crawler framework, deeply analyze the network security data platform according to the large language model, and obtain user identity threat intelligence data through the data acquisition crawler. The data acquisition crawler obtains multi-source user data from the enterprise's internal system, compresses the user identity threat intelligence data and the multi-source data to obtain a collection data set, imports the collected data set into the data processing platform, cleans and annotates the data through the large language model, and obtains a labeled data set; 提取处理模块,所述提取处理模块用于将所述标签化数据集基于集群计算系统和大语言模型解析提取处理得到特征数据集;An extraction processing module, configured to parse and extract the labeled data set based on a cluster computing system and a large language model to obtain a feature data set; 转化模块,所述转化模块用于基于词向量模型将所述特征数据集转化为高维语义向量,所述高维语义向量输入大语言模型构建用户行为语义关联图谱数据;A conversion module, which is used to convert the feature data set into a high-dimensional semantic vector based on a word vector model, and input the high-dimensional semantic vector into a large language model to construct user behavior semantic association graph data; 计算生成模块,所述计算生成模块用于将所述特征数据集和用户行为语义关联图谱数据输入大语言模型,所述大语言模型计算得到威胁概率数据并生成威胁报告;A calculation and generation module, the calculation and generation module is used to input the feature data set and the user behavior semantic association graph data into a large language model, the large language model calculates threat probability data and generates a threat report; 响应模块,所述响应模块用于根据所述威胁报告启动预设的实时响应机制,所述实时响应机制包括企业内部系统预设的响应机制和网络安全数据平台预设的响应机制;A response module, the response module is used to initiate a preset real-time response mechanism according to the threat report, the real-time response mechanism including a response mechanism preset in the enterprise internal system and a response mechanism preset in the network security data platform; 调整模块,所述调整模块用于获取历史事件数据并输入所述大语言模型,调整所述大语言模型预设的参数和实时响应机制。An adjustment module is used to obtain historical event data and input it into the large language model, and adjust the preset parameters and real-time response mechanism of the large language model.
CN202510821427.6A 2025-06-19 2025-06-19 A user identity threat detection method and system based on large language model Active CN120316757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510821427.6A CN120316757B (en) 2025-06-19 2025-06-19 A user identity threat detection method and system based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510821427.6A CN120316757B (en) 2025-06-19 2025-06-19 A user identity threat detection method and system based on large language model

Publications (2)

Publication Number Publication Date
CN120316757A CN120316757A (en) 2025-07-15
CN120316757B true CN120316757B (en) 2025-08-15

Family

ID=96328699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510821427.6A Active CN120316757B (en) 2025-06-19 2025-06-19 A user identity threat detection method and system based on large language model

Country Status (1)

Country Link
CN (1) CN120316757B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118246008A (en) * 2024-03-11 2024-06-25 浪潮卓数大数据产业发展有限公司 Data security intelligent inspection method and system based on large language model and AI-Agent
CN119094192A (en) * 2024-08-28 2024-12-06 广东电网有限责任公司东莞供电局 An AI-enhanced network security threat intelligence analysis method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102307632B1 (en) * 2021-05-31 2021-10-05 주식회사 아미크 Unusual Insider Behavior Detection Framework on Enterprise Resource Planning Systems using Adversarial Recurrent Auto-encoder
CN120151436A (en) * 2025-03-12 2025-06-13 哈尔滨工业大学 A large model-enhanced method and system for identifying high-risk users of telephone fraud

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118246008A (en) * 2024-03-11 2024-06-25 浪潮卓数大数据产业发展有限公司 Data security intelligent inspection method and system based on large language model and AI-Agent
CN119094192A (en) * 2024-08-28 2024-12-06 广东电网有限责任公司东莞供电局 An AI-enhanced network security threat intelligence analysis method and system

Also Published As

Publication number Publication date
CN120316757A (en) 2025-07-15

Similar Documents

Publication Publication Date Title
CN113923037B (en) Anomaly detection optimization device, method and system based on trusted computing
US12328330B1 (en) Alarm data processing method, apparatus, medium and electronic device
CN119047836B (en) A method, device, terminal equipment and storage medium for asset risk assessment of power monitoring system
CN120050097B (en) A network attack analysis method, system, program product, device and medium
CN117807590B (en) Information security prediction and monitoring system and method based on artificial intelligence
CN113704328B (en) User behavior big data mining method and system based on artificial intelligence
CN120528657A (en) A real-time monitoring method and system for data security incidents
CN115328934A (en) Database auditing method and device, electronic equipment and readable storage medium
CN120582899A (en) User entity behavior abnormality analysis and processing method, device, equipment and medium
CN118627516B (en) A natural language threat intelligence extraction and analysis method and system
CN120729556A (en) SecGPT-based network security alarm method, system, device and storage medium
CN120316757B (en) A user identity threat detection method and system based on large language model
CN120263442A (en) Network information security protection method based on big data
CN119941266A (en) Risk operation identification method, behavior record acquisition method and weight allocation method
CN119167358A (en) An effective network security incident monitoring method and system based on big data model
CN113888183A (en) Anti-fraud detection and analysis system based on multi-dimensional aggregated data
CN120389896B (en) DNS anomaly detection and domain name hijacking early warning method based on multi-source NLP
CN119172117B (en) A honeypot generation method for Web API
CN119675900B (en) A large-scale intrusion detection method and device based on user behavior data
US20250184335A1 (en) Security breach detection and mitigation in a cloud-based environment
CN121504630A (en) Risk prediction methods, devices, equipment, and storage media based on multimodal data
CN121000516A (en) Methods, devices, media, and program products for detecting fraudulent applications based on traffic behavior analysis
CN120930133A (en) Data flow supervision method and system
CN121502296A (en) A data security management method and system
CN121418153A (en) Abnormal behavior detection method, electronic device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant