WO2019057048A1 - Low-frequency crawler identification method, device, readable storage medium and equipment - Google Patents
Low-frequency crawler identification method, device, readable storage medium and equipment Download PDFInfo
- Publication number
- WO2019057048A1 WO2019057048A1 PCT/CN2018/106370 CN2018106370W WO2019057048A1 WO 2019057048 A1 WO2019057048 A1 WO 2019057048A1 CN 2018106370 W CN2018106370 W CN 2018106370W WO 2019057048 A1 WO2019057048 A1 WO 2019057048A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- behavior characteristics
- target behavior
- determining
- user
- threshold
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000007689 inspection Methods 0.000 claims abstract description 38
- 239000013598 vector Substances 0.000 claims abstract description 23
- 230000006399 behavior Effects 0.000 claims description 151
- 230000004044 response Effects 0.000 claims description 18
- 238000012795 verification Methods 0.000 claims description 16
- 239000003795 chemical substances by application Substances 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 230000005540 biological transmission Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000000513 principal component analysis Methods 0.000 claims description 4
- 238000007637 random forest analysis Methods 0.000 claims description 4
- 230000000903 blocking effect Effects 0.000 abstract description 4
- 230000002085 persistent effect Effects 0.000 abstract description 2
- 241000938605 Crocodylia Species 0.000 description 5
- 230000003542 behavioural effect Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000009193 crawling Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000003032 molecular docking Methods 0.000 description 2
- 230000002688 persistence Effects 0.000 description 2
- 101150012579 ADSL gene Proteins 0.000 description 1
- 102100020775 Adenylosuccinate lyase Human genes 0.000 description 1
- 108700040193 Adenylosuccinate lyases Proteins 0.000 description 1
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 241000270322 Lepidosauria Species 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007789 sealing Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/145—Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
Definitions
- This document relates to, but is not limited to, the field of Internet technology, and in particular, to a low frequency crawler identification method, apparatus, readable storage medium and device.
- the Internet is full of reptiles, and reptiles are constantly evolving in the process of anti-reptiles.
- the evolution of the reptile includes the following three stages: primary reptiles, browser reptiles, and low frequency reptiles.
- the primary crawler does not masquerade itself while crawling the target page. It can be accurately identified by features such as user-agent and frequency.
- the browser crawler will use the User-agent used by Firefox and opera. Various types of browsers, such as chrome, camouflage, and behave similarly to normal users.
- Browser crawlers can be identified by access frequency, timeline, etc.
- Low-frequency crawlers use a large number of proxy IP pools to mimic ordinary users for data crawling. A crawler, the low-frequency crawler is closer to the average user in the User-agent, frequency, timeline and other characteristics, especially the frequency is often 1 hour to have a single digit access.
- the prior art generally performs low frequency crawler identification by collecting a proxy IP library.
- the prior art has the following disadvantages:
- the recognition recall rate is limited by the coverage of the proxy IP library. At present, the number of Internet proxy IP is hundreds of millions, and the mobile agent IP library can only cover a small part;
- the proxy IP is not static. Therefore, it is necessary to update the proxy IP library frequently. Customers generally have a conflicting attitude towards online updates, and offline update will face the problem of update delay.
- an embodiment of the present invention provides a low frequency crawler identification method, apparatus, readable storage medium, and device.
- the behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
- Determining the inspection rule includes: determining N target behavior characteristics, setting a corresponding judgment logic and a threshold value of the N target behavior characteristics;
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold;
- Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics
- the products of the corresponding weights all satisfy the corresponding judgment logic and threshold.
- Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;
- the cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
- Determining N target behavior characteristics includes selecting a plurality of target behavior characteristics using a random forest algorithm or a principal component analysis algorithm.
- a feature calculation module configured to calculate, according to a network application log of each user IP, a behavior feature vector of each user IP in a preset time period
- the clustering module is configured to cluster the behavior feature vectors of each user IP to obtain a plurality of clusters
- a rule determination module configured to determine an inspection rule
- the identification module is configured to determine a cluster that satisfies the corresponding inspection rule, and each user IP in the cluster is determined to be a crawler.
- the behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
- the rule determination module is configured to determine N target behavior characteristics, and set a judgment logic and a threshold corresponding to the N target behavior characteristics;
- the identification module is configured to determine that the clusters satisfying the corresponding verification rules include: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and threshold ;
- the rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
- the identification module is configured to calculate an average value of all user IPs for the N target behavior characteristics in the current cluster, calculate a product of the average value and the corresponding weight, and determine that the product of the average value of the N target behavior characteristics and the corresponding weights are both satisfied. Corresponding judgment logic and threshold;
- the rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a threshold, an access threshold, and/or an access interval duration of the N target behavior characteristics;
- the identification module is configured to calculate an average of the access times and an average of the access intervals of all the IPs in the current cluster, and determine that the average number of access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval, and for the N in the current cluster.
- the target behavior characteristics respectively calculate the average value of all user IPs, and judge that the product of the average of the N target behavior characteristics and the corresponding weights satisfy the corresponding judgment logic and threshold.
- the computer readable storage medium provided by the embodiment of the present invention stores a computer program, and when the program is executed by the processor, the steps of the foregoing method are implemented.
- the computer device includes a memory, a processor, and a computer program stored on the memory and operable on the processor.
- the processor executes the program, the following content is implemented: calculating a preset time period according to the network application log of each user IP. a behavior feature vector of each user IP; clustering the behavior feature vectors of each user IP to obtain a plurality of clusters; determining a verification rule, determining a cluster that satisfies the corresponding inspection rule, and determining each user IP in the cluster as a crawler .
- the behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
- Determining the inspection rule includes: determining N target behavior characteristics, setting a corresponding judgment logic and a threshold value of the N target behavior characteristics;
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold;
- Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The product of the corresponding weights all meet the corresponding judgment logic and threshold;
- Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;
- the cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
- FIG. 1 is a flow chart of a low frequency crawler identification method in an embodiment
- Figure 2 is a structural view of a low frequency crawler identification device in the embodiment
- FIG. 3 is a structural diagram of a computer device for low frequency crawler recognition in the embodiment.
- the low frequency crawler identification method includes:
- Step 1 Calculate a behavior feature vector of each user IP in the preset time period according to the network application log of each user IP.
- Step 2 clustering behavior characteristic vectors of each user IP to obtain multiple clusters
- Step 3 Determine an inspection rule, determine a cluster that satisfies the corresponding inspection rule, and determine each user IP in the cluster as a crawler.
- the behavior feature in step 1 includes multiple of the following features: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring occupation Ratio, Referer maximum similar proportion, dangerous user agent (UA) ratio, UA maximum similar proportion, UA collection space, 404 status code proportion, 2XX status code proportion, 5XX status code proportion, URL type Maximum similarity ratio, average number of visits to similar URLs, average number of URL types, standard deviation of HTML request ratio, standard deviation of other request ratios, request response time, request response length, request return length, page views.
- UA dangerous user agent
- URL type Maximum similarity ratio, average number of visits to similar URLs, average number of URL types, standard deviation of HTML request ratio, standard deviation of other request ratios, request response time, request response length, request return length, page views.
- the calculated behavior characteristics are sorted in a preset order to form a behavior feature vector.
- the clustering algorithm in step 2 is a commonly used clustering algorithm in the prior art, such as K-Means, K-Medoids, GMM, Spectral clustering, Ncu.
- This method supports three identification methods.
- the determining the verification rule in step 3 includes: determining N target behavior characteristics, and setting a judgment logic and a threshold corresponding to the N target behavior characteristics.
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior features in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold.
- the determining the verification rule in step 3 includes: determining N target behavior characteristics, and setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics.
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The products of the corresponding weights all satisfy the corresponding judgment logic and threshold.
- the third type is the third type.
- the determining the verification rule in step 3 includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration.
- the cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
- the method for determining N target behavior characteristics comprises: selecting a N target behavior feature by using a random forest algorithm or a principal component analysis algorithm.
- the inspection rules include: determining that the three target behavior characteristics are respectively the maximum similar proportion of the Referer, the proportion of the space of the request path set, and the proportion of the 2XX status code.
- the judgment logic corresponding to the maximum similar proportion of the Referer is greater than the threshold value of 95%.
- the judgment logic of the request path set space ratio is greater than, and the threshold is 50%.
- the judgment logic of the 2XX status code ratio is greater than the threshold value of 50%.
- the average of the three target behavioral characteristics of all user IPs of the two clusters is calculated.
- the average of the three target behavioral characteristics of the first cluster is 100%, 50%, and 50%, respectively.
- the average of the three target behavioral characteristics in the second cluster is 80%, 40%, and 50%, respectively.
- the second cluster does not satisfy the inspection rule, and all user IPs in the cluster are normal users.
- options for various behavioral features, options for various clustering algorithms, display items for data security, and display items for crawler threats are designed.
- the selection of the corresponding behavior feature and the selection of the clustering algorithm may be selected according to the use requirements.
- the number of clusters divided into clusters may be displayed on the software interface.
- the area of each cluster is different and the size of each cluster corresponds to the number of user IPs in the cluster.
- the area of each cluster is also changed according to the change of user IP in the user. Variety.
- the crawling condition of the current system is determined to determine whether the current system is in a data security state or a crawler threat state and is indicated at the corresponding display item.
- Fig. 2 is a structural diagram of a low frequency crawler identification device in the embodiment.
- the low frequency crawler identification device includes a feature calculation module, a clustering module, a rule determination module, and an identification module.
- a feature calculation module configured to calculate, according to a network application log of each user IP, a behavior feature vector of each user IP in a preset time period
- the clustering module is configured to cluster the behavior feature vectors of each user IP to obtain a plurality of clusters
- a rule determination module configured to determine an inspection rule
- the identification module is configured to determine a cluster that satisfies the corresponding inspection rule, and each user IP in the cluster is determined to be a crawler.
- the behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
- This device supports three recognition methods.
- the rule determination module is configured to determine N target behavior characteristics, and set a judgment logic and a threshold corresponding to the N target behavior characteristics;
- the identification module is configured to determine that the clusters satisfying the corresponding verification rules include: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and threshold .
- the rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
- the identification module is configured to calculate an average value of all user IPs for the N target behavior characteristics in the current cluster, calculate a product of the average value and the corresponding weight, and determine that the product of the average value of the N target behavior characteristics and the corresponding weights are both satisfied. Corresponding judgment logic and threshold.
- the third type is the third type.
- the rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a threshold, an access threshold, and/or an access interval duration of the N target behavior characteristics;
- the identification module is configured to calculate an average of the access times and an average of the access intervals of all the IPs in the current cluster, and determine that the average number of access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval, and for the N in the current cluster.
- the target behavior characteristics respectively calculate the average value of all user IPs, and judge that the product of the average of the N target behavior characteristics and the corresponding weights satisfy the corresponding judgment logic and threshold.
- the rule determination module is further configured to select N target behavior features using a random forest algorithm or a principal component analysis algorithm.
- the embodiment of the present invention further provides a computer readable storage medium, where the computer program is stored on the storage medium, and the steps of the foregoing method are implemented when the program is executed by the processor.
- FIG. 3 is a structural diagram of a computer device for low frequency crawler identification in an embodiment, the computer device including a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor implementing the program to implement the following Calculating the behavior feature vector of each user IP in the preset time period according to the network application log of each user IP; clustering the behavior feature vector of each user IP to obtain multiple clusters; determining the inspection rule and determining that the corresponding inspection rule is met Cluster, each user IP in this cluster is determined to be a crawler.
- the behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
- Determining the inspection rule includes: determining N target behavior characteristics, setting a corresponding judgment logic and a threshold value of the N target behavior characteristics;
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior features in the current cluster, and determining that the average values of the N target behavior features satisfy the corresponding judgment logic and the threshold.
- Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
- the cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The product of the corresponding weights all meet the corresponding judgment logic and threshold;
- Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;
- the cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
- the embodiment of the invention has the following advantages:
- computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media.
- Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer.
- communication media typically includes computer readable instructions, data structures, program modules or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .
- the low-frequency crawler identification method, device, readable storage medium and device can effectively identify low-frequency crawlers and perform data modeling based on user behavior without any manual analysis or configuration, and automatically identify various deep-level threats through unsupervised clustering. It can solve gang threats, low frequency threats, associated threats, and persistent threats that are not recognized by traditional security products.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed herein are a low-frequency crawler identification method, a device, a readable storage medium and an equipment, the method comprising: computing a behavior feature vector of each user IP within a preset time slot according to a network application log of each user IP; clustering the behavior feature vector of each user IP to acquire a plurality of clusters; and determining an inspection rule, determining a cluster that meets the corresponding inspection rule, and determining each user IP in the cluster as a crawler. The embodiments of the present invention may effectively identify low frequency crawlers, and may solve group threats, low-frequency threats, associated threats and persistent threats, which traditional security products cannot identify. Public cloud or private cloud deployment is supported, and threat identification and blocking may be performed without changing network topology and without embedding any code; the joining of custom blocking interfaces is supported, and a deployment environment being completely switched off under extreme cases will not influence the normal operation of an original service.
Description
本申请要求在2017年9月20日提交中国专利局、申请号为201710857222.9,发明名称为“一种低频爬虫识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 200910857222.9, filed on Sep. 20, 2011, the entire disclosure of which is incorporated herein by reference. in.
本文涉及但不限于涉及互联网技术领域,尤其涉及一种低频爬虫识别方法、装置、可读存储介质及设备。This document relates to, but is not limited to, the field of Internet technology, and in particular, to a low frequency crawler identification method, apparatus, readable storage medium and device.
互联网中充斥着大量的爬虫,在反爬虫的过程中,爬虫也在不断进化。爬虫的进化过程包括以下三个阶段:初级爬虫、浏览器爬虫和低频爬虫。其中,初级爬虫对目标页面进行爬取的同时没有对自身进行伪装,可以通过诸如用户代理(User-agent)、频率等特征准确识别;浏览器爬虫会将自身使用的User-agent通过Firefox、opera、chrome等各种类型的浏览器进行伪装,行为上也会与正常用户相类似,浏览器爬虫可以通过访问频率、时间轴等特征识别;低频爬虫是使用大量代理IP池模仿普通用户进行数据爬取的一种爬虫,低频爬虫在User-agent、频率、时间轴等特征中与普通用户更为接近,特别是频率往往1小时才会有个位数的访问。The Internet is full of reptiles, and reptiles are constantly evolving in the process of anti-reptiles. The evolution of the reptile includes the following three stages: primary reptiles, browser reptiles, and low frequency reptiles. The primary crawler does not masquerade itself while crawling the target page. It can be accurately identified by features such as user-agent and frequency. The browser crawler will use the User-agent used by Firefox and opera. Various types of browsers, such as chrome, camouflage, and behave similarly to normal users. Browser crawlers can be identified by access frequency, timeline, etc. Low-frequency crawlers use a large number of proxy IP pools to mimic ordinary users for data crawling. A crawler, the low-frequency crawler is closer to the average user in the User-agent, frequency, timeline and other characteristics, especially the frequency is often 1 hour to have a single digit access.
现有技术一般通过收集代理IP库来进行低频爬虫识别。现有技术存在如下缺点:The prior art generally performs low frequency crawler identification by collecting a proxy IP library. The prior art has the following disadvantages:
(一)识别召回率受到代理IP库覆盖率的限制,目前互联网代理IP数以亿计,手机代理IP库只能够覆盖带很小部分;(1) The recognition recall rate is limited by the coverage of the proxy IP library. At present, the number of Internet proxy IP is hundreds of millions, and the mobile agent IP library can only cover a small part;
(二)代理IP并不是一成不变的,因此需要经常对代理IP库进行更新,客户对于在线更新一般会有抵触态度,而离线更新与会面临更新延时的问题;(2) The proxy IP is not static. Therefore, it is necessary to update the proxy IP library frequently. Customers generally have a conflicting attitude towards online updates, and offline update will face the problem of update delay.
(三)通过使用ADSL小区宽带断线重播、多播得到的代理IP更加隐蔽,并且这种IP会有许多真实用户使用,代理IP库会面临误封、无法准确识别等问题。(3) The proxy IP obtained by using the ADSL cell broadband disconnection replay and multicast is more concealed, and this IP will be used by many real users, and the proxy IP library will face problems such as incorrect sealing and inaccurate identification.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
为了解决上述技术问题,本发明实施例提供了一种低频爬虫识别方法、装置、可读存储介质及设备。In order to solve the above technical problem, an embodiment of the present invention provides a low frequency crawler identification method, apparatus, readable storage medium, and device.
本发明实施例提供的低频爬虫识别方法,包括:The low frequency crawler identification method provided by the embodiment of the invention includes:
根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;对各用户IP的行为特征矢量进行聚类获得多个簇;确定检验规则,判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。Calculating the behavior feature vector of each user IP in the preset time period according to the network application log of each user IP; clustering the behavior feature vector of each user IP to obtain multiple clusters; determining the inspection rule, and determining the cluster that satisfies the corresponding inspection rule , each user IP in this cluster is determined to be a crawler.
上述低频爬虫识别方法还具有以下特点:The above low frequency crawler identification method also has the following characteristics:
行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
上述低频爬虫识别方法还具有以下特点:The above low frequency crawler identification method also has the following characteristics:
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;Determining the inspection rule includes: determining N target behavior characteristics, setting a corresponding judgment logic and a threshold value of the N target behavior characteristics;
判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值;The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold;
或者,or,
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分 别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The products of the corresponding weights all satisfy the corresponding judgment logic and threshold.
上述低频爬虫识别方法还具有以下特点:The above low frequency crawler identification method also has the following characteristics:
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;
判断出满足相应的检验规则的簇包括:计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
上述低频爬虫识别方法还具有以下特点:The above low frequency crawler identification method also has the following characteristics:
确定N个目标行为特征包括:使用随机森林算法或者主要成分分析算法选择出N个目标行为特征。Determining N target behavior characteristics includes selecting a plurality of target behavior characteristics using a random forest algorithm or a principal component analysis algorithm.
本发明实施例提供的低频爬虫识别装置,包括:The low frequency crawler identification device provided by the embodiment of the invention includes:
特征计算模块,设置为根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;a feature calculation module, configured to calculate, according to a network application log of each user IP, a behavior feature vector of each user IP in a preset time period;
聚类模块,设置为对各用户IP的行为特征矢量进行聚类获得多个簇;The clustering module is configured to cluster the behavior feature vectors of each user IP to obtain a plurality of clusters;
规则确定模块,设置为确定检验规则;a rule determination module configured to determine an inspection rule;
识别模块,设置为判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。The identification module is configured to determine a cluster that satisfies the corresponding inspection rule, and each user IP in the cluster is determined to be a crawler.
上述低频爬虫识别装置还具有以下特点:The above low frequency crawler identification device also has the following characteristics:
行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
上述低频爬虫识别装置还具有以下特点:The above low frequency crawler identification device also has the following characteristics:
规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic and a threshold corresponding to the N target behavior characteristics;
识别模块设置为判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值;The identification module is configured to determine that the clusters satisfying the corresponding verification rules include: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and threshold ;
或者,or,
规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
识别模块设置为针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值;The identification module is configured to calculate an average value of all user IPs for the N target behavior characteristics in the current cluster, calculate a product of the average value and the corresponding weight, and determine that the product of the average value of the N target behavior characteristics and the corresponding weights are both satisfied. Corresponding judgment logic and threshold;
或者,or,
规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a threshold, an access threshold, and/or an access interval duration of the N target behavior characteristics;
识别模块设置为计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The identification module is configured to calculate an average of the access times and an average of the access intervals of all the IPs in the current cluster, and determine that the average number of access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval, and for the N in the current cluster. The target behavior characteristics respectively calculate the average value of all user IPs, and judge that the product of the average of the N target behavior characteristics and the corresponding weights satisfy the corresponding judgment logic and threshold.
本发明实施例提供的计算机可读存储介质上存储有计算机程序,程序被处理器执行时实现上述方法的步骤。The computer readable storage medium provided by the embodiment of the present invention stores a computer program, and when the program is executed by the processor, the steps of the foregoing method are implemented.
本发明实施例提供的计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现以下内容:根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;对各用户IP的行为特征矢量进行聚类获得多个簇;确定检验规则,判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。The computer device provided by the embodiment of the present invention includes a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the program, the following content is implemented: calculating a preset time period according to the network application log of each user IP. a behavior feature vector of each user IP; clustering the behavior feature vectors of each user IP to obtain a plurality of clusters; determining a verification rule, determining a cluster that satisfies the corresponding inspection rule, and determining each user IP in the cluster as a crawler .
上述计算机设备还具有以下特点:The above computer equipment also has the following characteristics:
行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环 占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;Determining the inspection rule includes: determining N target behavior characteristics, setting a corresponding judgment logic and a threshold value of the N target behavior characteristics;
判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值;The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold;
或者,or,
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值;The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The product of the corresponding weights all meet the corresponding judgment logic and threshold;
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;
判断出满足相应的检验规则的簇包括:计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。本发明实施例具有以下优点:The cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold. Embodiments of the invention have the following advantages:
(1)可以有效识别低频爬虫。(1) It is possible to effectively identify low frequency crawlers.
(2)基于用户行为进行数据建模,无需任何人工分析或者配置,通过无监督聚类自动智能识别各种深层次威胁,可以解决传统安全产品无法识别的团伙威胁、低频威胁、关联威胁、持续威胁等。(2) Data modeling based on user behavior, without any manual analysis or configuration, automatic intelligent identification of various deep threats through unsupervised clustering, can solve gang threats, low frequency threats, associated threats, persistence that traditional security products cannot identify Threats, etc.
(3)支持公有云或私有云部署,无需更改网络拓扑,无需嵌入任何代码,即可进行威胁识别和阻断,支持对接自定义阻断接口,极端情况下,即使部署 环境全部断电,不会影响原业务正常运行。(3) Support public cloud or private cloud deployment, no need to change the network topology, no need to embed any code, can identify and block threats, support docking custom blocking interface, in extreme cases, even if the deployment environment is completely powered off, Will affect the normal operation of the original business.
此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,本发明实施例的示意性实施例及其说明解释本发明实施例,并不构成对本发明实施例的不当限定。在附图中:The drawings described herein are intended to provide a further understanding of the embodiments of the invention, and the embodiments of the embodiments of the invention limited. In the drawing:
图1是实施例中低频爬虫识别方法的流程图;1 is a flow chart of a low frequency crawler identification method in an embodiment;
图2是实施例中低频爬虫识别装置的结构图;Figure 2 is a structural view of a low frequency crawler identification device in the embodiment;
图3是实施例中用于低频爬虫识别的计算机设备的结构图。3 is a structural diagram of a computer device for low frequency crawler recognition in the embodiment.
现结合附图和具体实施方式对本发明实施例进一步说明。The embodiments of the present invention will be further described with reference to the drawings and specific embodiments.
图1是实施例中低频爬虫识别方法的流程图,此低频爬虫识别方法包括:1 is a flowchart of a low frequency crawler identification method in an embodiment, and the low frequency crawler identification method includes:
步骤1,根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;Step 1: Calculate a behavior feature vector of each user IP in the preset time period according to the network application log of each user IP.
步骤2,对各用户IP的行为特征矢量进行聚类获得多个簇;Step 2: clustering behavior characteristic vectors of each user IP to obtain multiple clusters;
步骤3,确定检验规则,判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。Step 3: Determine an inspection rule, determine a cluster that satisfies the corresponding inspection rule, and determine each user IP in the cluster as a crawler.
其中,among them,
步骤1中的行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理(User Agent,UA)占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior feature in step 1 includes multiple of the following features: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring occupation Ratio, Referer maximum similar proportion, dangerous user agent (UA) ratio, UA maximum similar proportion, UA collection space, 404 status code proportion, 2XX status code proportion, 5XX status code proportion, URL type Maximum similarity ratio, average number of visits to similar URLs, average number of URL types, standard deviation of HTML request ratio, standard deviation of other request ratios, request response time, request response length, request return length, page views.
例如:E.g:
行为特征Behavioral characteristics | 值value |
平均请求发送字节数Average number of bytes sent by the request | 31283128 |
请求数Number of requests | 291291 |
GET请求数占比GET requests | 100%100% |
UA最大相似占比UA maximum similar proportion | 100%100% |
Referer最大相似占比Referer maximum similar proportion | 100%100% |
请求路径集合空间占比Request path collection space ratio | 56%56% |
2XX状态码占比2XX status code ratio | 50%50% |
URL类型最大相似占比URL type maximum similar proportion | 49%49% |
URL类型平均数Average number of URL types | 28.6828.68 |
HTML请求占比的标准差Standard deviation of HTML requests | 0.020.02 |
其他请求占比的标准差Standard deviation of other requests | 00 |
同类URL平均访问次数Average number of visits to similar URLs | 00 |
将计算到的行为特征按预设顺序排序构成行为特征矢量。The calculated behavior characteristics are sorted in a preset order to form a behavior feature vector.
步骤2中聚类算法是现有技术中常用的可以用聚类的算法,例如K-Means、K-Medoids、GMM、Spectral clustering、Ncu。The clustering algorithm in step 2 is a commonly used clustering algorithm in the prior art, such as K-Means, K-Medoids, GMM, Spectral clustering, Ncu.
本方法支持三种识别方法。This method supports three identification methods.
第一种:The first:
步骤3中确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值。判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值。The determining the verification rule in step 3 includes: determining N target behavior characteristics, and setting a judgment logic and a threshold corresponding to the N target behavior characteristics. The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior features in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold.
第二种:Second:
步骤3中确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值。判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The determining the verification rule in step 3 includes: determining N target behavior characteristics, and setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics. The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The products of the corresponding weights all satisfy the corresponding judgment logic and threshold.
第三种:The third type:
步骤3中确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长。判断出满足相应的检验规则的簇包括:计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The determining the verification rule in step 3 includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration. The cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
本方法中,确定N个目标行为特征的方法包括:使用随机森林算法或者主要成分分析算法选择出N个目标行为特征。In the method, the method for determining N target behavior characteristics comprises: selecting a N target behavior feature by using a random forest algorithm or a principal component analysis algorithm.
具体实施例:Specific embodiment:
采集某个月内各用户IP的网络应用日志,计算此月内各用户IP的行为特征矢量。对各用户IP的行为特征矢量进行聚类获得两个簇。Collect the network application logs of each user's IP in a certain month, and calculate the behavior feature vector of each user's IP in this month. Clustering the behavior feature vectors of each user IP to obtain two clusters.
检验规则包括:确定3个目标行为特征分别为Referer最大相似占比、请求路径集合空间占比、2XX状态码占比。The inspection rules include: determining that the three target behavior characteristics are respectively the maximum similar proportion of the Referer, the proportion of the space of the request path set, and the proportion of the 2XX status code.
Referer最大相似占比对应的判断逻辑为大于,阈值为95%。The judgment logic corresponding to the maximum similar proportion of the Referer is greater than the threshold value of 95%.
请求路径集合空间占比的判断逻辑为大于,阈值为50%。The judgment logic of the request path set space ratio is greater than, and the threshold is 50%.
2XX状态码占比的判断逻辑为大于,阈值为50%。The judgment logic of the 2XX status code ratio is greater than the threshold value of 50%.
计算分别两个簇的所有用户IP的此3个目标行为特征的平均值,第一个簇的中此3个目标行为特征的平均值分别为100%,50%,50%。则此第一个簇满足检验规则,此簇中所有用户IP均为爬虫。第二个簇的中此3个目标行为特征的平均值分别为80%,40%,50%。则此第二个簇不满足检验规则,此簇中所有用户IP均为正常用户。在实现此方法的软件中,设计有各种行为特征的选择项,各种聚类算法的选择项,表示数据安全的显示项和表示爬虫威胁的显示项。在使用此软件的过程中,可以根据使用需要,选择相应的行为特征的选择项,和聚类算法的选择项,执行此方法后,软件界面上可显示分成的簇的个数,每个簇的面积不尽相同并且每个簇的面积的大小对应于此簇中用户IP的数量,随着此方法的演算递进过程,每个簇的面积也根据其内用户IP情况的变化而相应的变化。根据此方法的演进结果,确定当前系统的爬虫情况确定当前系统处于数据 安全状态或者是爬虫威胁状态并在相应显示项处进行指示。The average of the three target behavioral characteristics of all user IPs of the two clusters is calculated. The average of the three target behavioral characteristics of the first cluster is 100%, 50%, and 50%, respectively. Then the first cluster satisfies the inspection rule, and all user IPs in this cluster are crawlers. The average of the three target behavioral characteristics in the second cluster is 80%, 40%, and 50%, respectively. Then the second cluster does not satisfy the inspection rule, and all user IPs in the cluster are normal users. In the software implementing this method, options for various behavioral features, options for various clustering algorithms, display items for data security, and display items for crawler threats are designed. In the process of using this software, the selection of the corresponding behavior feature and the selection of the clustering algorithm may be selected according to the use requirements. After the method is executed, the number of clusters divided into clusters may be displayed on the software interface. The area of each cluster is different and the size of each cluster corresponds to the number of user IPs in the cluster. With the process of calculation of this method, the area of each cluster is also changed according to the change of user IP in the user. Variety. According to the evolution result of the method, the crawling condition of the current system is determined to determine whether the current system is in a data security state or a crawler threat state and is indicated at the corresponding display item.
图2是实施例中低频爬虫识别装置的结构图。此低频爬虫识别装置包括特征计算模块、聚类模块、规则确定模块和识别模块。Fig. 2 is a structural diagram of a low frequency crawler identification device in the embodiment. The low frequency crawler identification device includes a feature calculation module, a clustering module, a rule determination module, and an identification module.
特征计算模块,设置为根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;a feature calculation module, configured to calculate, according to a network application log of each user IP, a behavior feature vector of each user IP in a preset time period;
聚类模块,设置为对各用户IP的行为特征矢量进行聚类获得多个簇;The clustering module is configured to cluster the behavior feature vectors of each user IP to obtain a plurality of clusters;
规则确定模块,设置为确定检验规则;a rule determination module configured to determine an inspection rule;
识别模块,设置为判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。The identification module is configured to determine a cluster that satisfies the corresponding inspection rule, and each user IP in the cluster is determined to be a crawler.
其中,among them,
行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
本装置支持三种识别方式。This device supports three recognition methods.
第一种:The first:
规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic and a threshold corresponding to the N target behavior characteristics;
识别模块设置为判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值。The identification module is configured to determine that the clusters satisfying the corresponding verification rules include: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and threshold .
第二种:Second:
规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
识别模块设置为针对当前簇中N个目标行为特征分别计算所有用户IP的平 均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The identification module is configured to calculate an average value of all user IPs for the N target behavior characteristics in the current cluster, calculate a product of the average value and the corresponding weight, and determine that the product of the average value of the N target behavior characteristics and the corresponding weights are both satisfied. Corresponding judgment logic and threshold.
第三种:The third type:
规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic, a threshold, an access threshold, and/or an access interval duration of the N target behavior characteristics;
识别模块设置为计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The identification module is configured to calculate an average of the access times and an average of the access intervals of all the IPs in the current cluster, and determine that the average number of access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval, and for the N in the current cluster. The target behavior characteristics respectively calculate the average value of all user IPs, and judge that the product of the average of the N target behavior characteristics and the corresponding weights satisfy the corresponding judgment logic and threshold.
规则确定模块还设置为使用随机森林算法或者主要成分分析算法选择出N个目标行为特征。The rule determination module is further configured to select N target behavior features using a random forest algorithm or a principal component analysis algorithm.
本发明实施例中还提供了一种计算机可读存储介质,存储介质上存储有计算机程序,程序被处理器执行时实现上述方法的步骤。The embodiment of the present invention further provides a computer readable storage medium, where the computer program is stored on the storage medium, and the steps of the foregoing method are implemented when the program is executed by the processor.
图3是实施例中用于低频爬虫识别的计算机设备的结构图,此计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现以下内容:根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;对各用户IP的行为特征矢量进行聚类获得多个簇;确定检验规则,判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。3 is a structural diagram of a computer device for low frequency crawler identification in an embodiment, the computer device including a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor implementing the program to implement the following Calculating the behavior feature vector of each user IP in the preset time period according to the network application log of each user IP; clustering the behavior feature vector of each user IP to obtain multiple clusters; determining the inspection rule and determining that the corresponding inspection rule is met Cluster, each user IP in this cluster is determined to be a crawler.
行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior characteristics include multiple of the following characteristics: average request transmission bytes, unit time period requests, GET request number ratio, request path set space ratio, path maximum similar proportion, path maximum repeat ring ratio, Referer maximum Similar proportion, dangerous user agent UA ratio, UA maximum similar proportion, UA collection space, 404 status code ratio, 2XX status code ratio, 5XX status code ratio, maximum similar proportion of URL type, average access of similar URLs The number of times, the average number of URL types, the standard deviation of the proportion of HTML requests, the standard deviation of other requests, the response time of requests, the length of request responses, the length of request returns, and the number of page views.
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;Determining the inspection rule includes: determining N target behavior characteristics, setting a corresponding judgment logic and a threshold value of the N target behavior characteristics;
判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分 别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值。The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior features in the current cluster, and determining that the average values of the N target behavior features satisfy the corresponding judgment logic and the threshold.
或者,or,
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;
判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值;The cluster that satisfies the corresponding test rule includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average value of the N target behavior characteristics The product of the corresponding weights all meet the corresponding judgment logic and threshold;
或者,or,
确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;Determining the inspection rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;
判断出满足相应的检验规则的簇包括:计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The cluster that satisfies the corresponding inspection rule includes: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the access threshold and/or the average of the access intervals is greater than the duration of the access interval. For each of the N target behavior characteristics in the current cluster, the average value of all user IPs is calculated, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
与现有技术相比,本发明实施例具有以下优点:Compared with the prior art, the embodiment of the invention has the following advantages:
(1)可以有效识别低频爬虫。(1) It is possible to effectively identify low frequency crawlers.
(2)基于用户行为进行数据建模,无需任何人工分析或者配置,通过无监督聚类自动智能识别各种深层次威胁,可以解决传统安全产品无法识别的团伙威胁、低频威胁、关联威胁、持续威胁等。(2) Data modeling based on user behavior, without any manual analysis or configuration, automatic intelligent identification of various deep threats through unsupervised clustering, can solve gang threats, low frequency threats, associated threats, persistence that traditional security products cannot identify Threats, etc.
(3)支持公有云或私有云部署,无需更改网络拓扑,无需嵌入任何代码,即可进行威胁识别和阻断,支持对接自定义阻断接口,极端情况下,即使部署环境全部断电,不会影响原业务正常运行。(3) Support public cloud or private cloud deployment, no need to change the network topology, no need to embed any code, can identify and block threats, support docking custom blocking interface, in extreme cases, even if the deployment environment is completely powered off, Will affect the normal operation of the original business.
本领域的普通技术人员应当理解,可以对本发明技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围,均应涵盖在权利要求范围当中。Those skilled in the art should understand that the invention may be modified or equivalently substituted without departing from the spirit and scope of the invention.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。 在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and functional blocks/units of the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically includes computer readable instructions, data structures, program modules or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .
本文中低频爬虫识别方法、装置、可读存储介质及设备可以有效识别低频爬虫,并且基于用户行为进行数据建模,无需任何人工分析或者配置,通过无监督聚类自动智能识别各种深层次威胁,可以解决传统安全产品无法识别的团伙威胁、低频威胁、关联威胁、持续威胁等。The low-frequency crawler identification method, device, readable storage medium and device can effectively identify low-frequency crawlers and perform data modeling based on user behavior without any manual analysis or configuration, and automatically identify various deep-level threats through unsupervised clustering. It can solve gang threats, low frequency threats, associated threats, and persistent threats that are not recognized by traditional security products.
Claims (12)
- 一种低频爬虫识别方法,包括:A low frequency crawler identification method includes:根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;对各用户IP的行为特征矢量进行聚类获得多个簇;确定检验规则,判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。Calculating the behavior feature vector of each user IP in the preset time period according to the network application log of each user IP; clustering the behavior feature vector of each user IP to obtain multiple clusters; determining the inspection rule, and determining the cluster that satisfies the corresponding inspection rule , each user IP in this cluster is determined to be a crawler.
- 如权利要求1所述的低频爬虫识别方法,其中,The low frequency crawler identification method according to claim 1, wherein所述行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior feature includes a plurality of the following features: an average number of request transmission bytes, a number of unit time period requests, a GET request number ratio, a request path set space ratio, a path maximum similar proportion, a path maximum repeat ring ratio, Referer maximum similar proportion, dangerous user agent UA proportion, UA maximum similar proportion, UA collection space, 404 status code proportion, 2XX status code proportion, 5XX status code proportion, maximum similar proportion of URL type, similar URL Average number of visits, average number of URL types, standard deviation of HTML request ratios, standard deviation of other request ratios, request response time, request response length, request return length, page views.
- 如权利要求1所述的低频爬虫识别方法,其中,The low frequency crawler identification method according to claim 1, wherein所述确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;The determining the verification rule comprises: determining N target behavior characteristics, setting a judgment logic and a threshold corresponding to the N target behavior characteristics;所述判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值;The determining that the clusters satisfying the corresponding inspection rules comprise: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold;或者,or,所述确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;The determining the verification rule comprises: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;所述判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The cluster that determines that the corresponding inspection rule is satisfied includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average of the N target behavior characteristics. The product of the value and the corresponding weight satisfies the corresponding judgment logic and threshold.
- 如权利要求1所述的低频爬虫识别方法,其中,The low frequency crawler identification method according to claim 1, wherein所述确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征 相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;The determining the verification rule comprises: determining N target behavior characteristics, setting a corresponding judgment logic, a threshold value, an access number threshold, and/or an access interval duration of the N target behavior characteristics;所述判断出满足相应的检验规则的簇包括:计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于所述访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The determining that the cluster that meets the corresponding verification rule comprises: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the threshold of the number of accesses and/or the average of the access intervals is greater than After the interval duration is accessed, the average values of all user IPs are calculated for the N target behavior characteristics in the current cluster, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
- 如权利要求3或4所述的低频爬虫识别方法,其中,The low frequency crawler identification method according to claim 3 or 4, wherein所述确定N个目标行为特征包括:使用随机森林算法或者主要成分分析算法选择出N个目标行为特征。The determining the N target behavior characteristics comprises: selecting a N target behavior characteristics using a random forest algorithm or a principal component analysis algorithm.
- 一种低频爬虫识别装置,包括:A low frequency crawler identification device comprising:特征计算模块,设置为根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;a feature calculation module, configured to calculate, according to a network application log of each user IP, a behavior feature vector of each user IP in a preset time period;聚类模块,设置为对各用户IP的行为特征矢量进行聚类获得多个簇;The clustering module is configured to cluster the behavior feature vectors of each user IP to obtain a plurality of clusters;规则确定模块,设置为确定检验规则;a rule determination module configured to determine an inspection rule;识别模块,设置为判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。The identification module is configured to determine a cluster that satisfies the corresponding inspection rule, and each user IP in the cluster is determined to be a crawler.
- 如权利要求6所述的低频爬虫识别装置,其中,The low frequency crawler identification device according to claim 6, wherein所述行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、页面浏览量。The behavior feature includes a plurality of the following features: an average number of request transmission bytes, a number of unit time period requests, a GET request number ratio, a request path set space ratio, a path maximum similar proportion, a path maximum repeat ring ratio, Referer maximum similar proportion, dangerous user agent UA proportion, UA maximum similar proportion, UA collection space, 404 status code proportion, 2XX status code proportion, 5XX status code proportion, maximum similar proportion of URL type, similar URL Average number of visits, average number of URL types, standard deviation of HTML request ratios, standard deviation of other request ratios, request response time, request response length, request return length, page views.
- 如权利要求6所述的低频爬虫识别装置,其中,The low frequency crawler identification device according to claim 6, wherein所述规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;The rule determination module is configured to determine N target behavior characteristics, and set a judgment logic and a threshold corresponding to the N target behavior characteristics;所述识别模块设置为判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均 值均满足相应的判断逻辑和阈值;The determining module is configured to determine that the clusters satisfying the corresponding verification rule comprise: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic And threshold;或者,or,所述规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;The rule determining module is configured to determine N target behavior characteristics, and set a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;所述识别模块设置为针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值;The identification module is configured to calculate an average value of all user IPs for the N target behavior characteristics in the current cluster, calculate a product of the average value and the corresponding weight, and determine a product of the average value of the N target behavior characteristics and the corresponding weight. Both meet the corresponding judgment logic and threshold;或者,or,所述规则确定模块设置为确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;The rule determining module is configured to determine N target behavior characteristics, and set a determination logic, a threshold, an access threshold, and/or an access interval duration of the N target behavior characteristics;所述识别模块设置为计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于所述访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The identification module is configured to calculate an average value of access times and an average value of access intervals of all IPs in the current cluster, and determine that the average number of access times is greater than the threshold of the number of accesses and/or the average value of the access intervals is greater than the duration of the access interval, The N target behavior characteristics in the cluster respectively calculate the average value of all user IPs, and judge that the product of the average of the N target behavior characteristics and the corresponding weights satisfy the corresponding judgment logic and threshold.
- 一种计算机可读存储介质,所述存储介质上存储有计算机程序,所述程序被处理器执行时实现权利要求1至5中任意一项所述方法的步骤。A computer readable storage medium having stored thereon a computer program, the program being executed by a processor to perform the steps of the method of any one of claims 1 to 5.
- 一种计算机设备,其中,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现以下内容:根据各用户IP的网络应用日志计算预设时段内各用户IP的行为特征矢量;对各用户IP的行为特征矢量进行聚类获得多个簇;确定检验规则,判断出满足相应的检验规则的簇,将此簇中的各用户IP确定为爬虫。A computer device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the program to implement the following: according to each user IP The network application log calculates a behavior feature vector of each user IP in a preset time period; clusters the behavior feature vectors of each user IP to obtain a plurality of clusters; determines a verification rule, and determines a cluster that satisfies the corresponding inspection rule, in the cluster Each user IP is determined to be a crawler.
- 如权利要求10所述的计算机设备,其中,The computer device according to claim 10, wherein所述行为特征包括以下特征中的多个:平均请求发送字节数、单位时段请求数、GET请求数占比、请求路径集合空间占比、路径最大相似占比、路径最大重复环占比、Referer最大相似占比、危险用户代理UA占比、UA最大相似占比、UA集合空间、404状态码占比、2XX状态码占比、5XX状态码占比、URL类型最大相似占比、同类URL平均访问次数、URL类型平均数、HTML请求占比的标准差、其他请求占比的标准差、请求响应时间、请求响应长度、请求返回长度、 页面浏览量。The behavior feature includes a plurality of the following features: an average number of request transmission bytes, a number of unit time period requests, a GET request number ratio, a request path set space ratio, a path maximum similar proportion, a path maximum repeat ring ratio, Referer maximum similar proportion, dangerous user agent UA proportion, UA maximum similar proportion, UA collection space, 404 status code proportion, 2XX status code proportion, 5XX status code proportion, maximum similar proportion of URL type, similar URL Average number of visits, average number of URL types, standard deviation of HTML request ratios, standard deviation of other request ratios, request response time, request response length, request return length, page views.
- 如权利要求10所述的计算机设备,其中,The computer device according to claim 10, wherein所述确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑和阈值;The determining the verification rule comprises: determining N target behavior characteristics, setting a judgment logic and a threshold corresponding to the N target behavior characteristics;所述判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值均满足相应的判断逻辑和阈值;The determining that the clusters satisfying the corresponding inspection rules comprise: calculating average values of all user IPs for the N target behavior characteristics in the current cluster, and determining that the average values of the N target behavior characteristics satisfy the corresponding judgment logic and the threshold;或者,or,所述确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、权重、阈值;The determining the verification rule comprises: determining N target behavior characteristics, setting a judgment logic, a weight, and a threshold corresponding to the N target behavior characteristics;所述判断出满足相应的检验规则的簇包括:针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,计算此平均值与相应的权重的积,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值;The cluster that determines that the corresponding inspection rule is satisfied includes: calculating an average value of all user IPs for the N target behavior characteristics in the current cluster, calculating a product of the average value and the corresponding weight, and determining an average of the N target behavior characteristics. The product of the value and the corresponding weight meets the corresponding judgment logic and threshold;所述确定检验规则包括:确定N个目标行为特征,设置N个目标行为特征相应的判断逻辑、阈值、访问次数阈值和/或访问间隔时长;The determining the verification rule includes: determining N target behavior characteristics, setting a judgment logic corresponding to the N target behavior characteristics, a threshold, an access threshold, and/or an access interval duration;所述判断出满足相应的检验规则的簇包括:计算当前簇中所有IP的访问次数平均值和访问间隔平均值,判断此访问次数平均值大于所述访问次数阈值和/或访问间隔平均值大于访问间隔时长后,针对当前簇中N个目标行为特征分别计算所有用户IP的平均值,判断N个目标行为特征的平均值与相应的权重的积均满足相应的判断逻辑和阈值。The determining that the cluster that meets the corresponding verification rule comprises: calculating an average value of the access times of all IPs in the current cluster and an average value of the access interval, and determining that the average number of the access times is greater than the threshold of the number of accesses and/or the average of the access intervals is greater than After the interval duration is accessed, the average values of all user IPs are calculated for the N target behavior characteristics in the current cluster, and the product of the average of the N target behavior characteristics and the corresponding weights is determined to satisfy the corresponding judgment logic and threshold.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710857222.9 | 2017-09-20 | ||
CN201710857222.9A CN107800684B (en) | 2017-09-20 | 2017-09-20 | A kind of low frequency reptile recognition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019057048A1 true WO2019057048A1 (en) | 2019-03-28 |
Family
ID=61532421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/106370 WO2019057048A1 (en) | 2017-09-20 | 2018-09-19 | Low-frequency crawler identification method, device, readable storage medium and equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107800684B (en) |
WO (1) | WO2019057048A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597372A (en) * | 2020-12-25 | 2021-04-02 | 北京知因智慧科技有限公司 | Distributed crawler implementation method and device |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107800684B (en) * | 2017-09-20 | 2018-09-18 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
CN108763274B (en) * | 2018-04-09 | 2021-06-11 | 北京三快在线科技有限公司 | Access request identification method and device, electronic equipment and storage medium |
CN110912861B (en) * | 2018-09-18 | 2022-02-15 | 北京数安鑫云信息技术有限公司 | AI detection method and device for deeply tracking group attack behavior |
CN109446398A (en) * | 2018-11-06 | 2019-03-08 | 杭州安恒信息技术股份有限公司 | The method, apparatus and electronic equipment of intelligent measurement web crawlers behavior |
CN109992960B (en) * | 2018-12-06 | 2021-09-10 | 北京奇艺世纪科技有限公司 | Counterfeit parameter detection method and device, electronic equipment and storage medium |
CN110147271B (en) * | 2019-05-15 | 2020-04-28 | 重庆八戒传媒有限公司 | Method and device for improving quality of crawler proxy and computer readable storage medium |
CN112800419A (en) * | 2019-11-13 | 2021-05-14 | 北京数安鑫云信息技术有限公司 | Method, apparatus, medium and device for identifying IP group |
CN110995714B (en) * | 2019-12-06 | 2022-07-26 | 杭州安恒信息技术股份有限公司 | A method, device and medium for detecting gang attacks on Web sites |
CN112989157B (en) * | 2019-12-13 | 2024-11-12 | 网宿科技股份有限公司 | A method and device for detecting crawler requests |
CN111831881B (en) * | 2020-07-04 | 2023-03-21 | 西安交通大学 | Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm |
CN111914905B (en) * | 2020-07-09 | 2021-07-20 | 北京人人云图信息技术有限公司 | Anti-crawler system based on semi-supervision and design method |
CN113452685B (en) * | 2021-06-22 | 2024-04-09 | 上海明略人工智能(集团)有限公司 | Processing method, system, storage medium and electronic equipment for recognition rule |
CN114338099B (en) * | 2021-12-10 | 2024-08-16 | 壹药网科技(上海)股份有限公司 | Recognition method and prevention system for crawler behaviors |
CN114969678A (en) * | 2021-12-23 | 2022-08-30 | 中国太平洋保险(集团)股份有限公司 | Multi-technology fused intelligent anti-crawler method and system thereof |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140337005A1 (en) * | 2013-05-08 | 2014-11-13 | Microsoft Corporation | Cross-lingual automatic query annotation |
CN105930727A (en) * | 2016-04-25 | 2016-09-07 | 无锡中科富农物联科技有限公司 | Web-based crawler identification algorithm |
CN106682118A (en) * | 2016-12-08 | 2017-05-17 | 华中科技大学 | Social network site false fan detection method achieved on basis of network crawler by means of machine learning |
CN106790175A (en) * | 2016-12-29 | 2017-05-31 | 北京神州绿盟信息安全科技股份有限公司 | The detection method and device of a kind of worm event |
CN107092660A (en) * | 2017-03-28 | 2017-08-25 | 成都优易数据有限公司 | A kind of Website server reptile recognition methods and device |
CN107800684A (en) * | 2017-09-20 | 2018-03-13 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2391346A (en) * | 2002-07-31 | 2004-02-04 | Hewlett Packard Co | On-line recognition of robots |
CN102495861B (en) * | 2011-11-24 | 2013-09-04 | 中国科学院计算技术研究所 | System and method for identifying web crawler |
CN104391979B (en) * | 2014-12-05 | 2017-12-19 | 北京国双科技有限公司 | Network malice reptile recognition methods and device |
CN106202108B (en) * | 2015-05-06 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Web crawlers grabs method for allocating tasks and device and data grab method and device |
CN106487708B (en) * | 2015-08-25 | 2020-03-13 | 阿里巴巴集团控股有限公司 | Network access request control method and device |
CN105577701B (en) * | 2016-03-09 | 2018-11-09 | 携程计算机技术(上海)有限公司 | The recognition methods of web crawlers and system |
CN107147640B (en) * | 2017-05-09 | 2019-12-31 | 网宿科技股份有限公司 | Method and system for identifying web crawlers |
-
2017
- 2017-09-20 CN CN201710857222.9A patent/CN107800684B/en active Active
-
2018
- 2018-09-19 WO PCT/CN2018/106370 patent/WO2019057048A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140337005A1 (en) * | 2013-05-08 | 2014-11-13 | Microsoft Corporation | Cross-lingual automatic query annotation |
CN105930727A (en) * | 2016-04-25 | 2016-09-07 | 无锡中科富农物联科技有限公司 | Web-based crawler identification algorithm |
CN106682118A (en) * | 2016-12-08 | 2017-05-17 | 华中科技大学 | Social network site false fan detection method achieved on basis of network crawler by means of machine learning |
CN106790175A (en) * | 2016-12-29 | 2017-05-31 | 北京神州绿盟信息安全科技股份有限公司 | The detection method and device of a kind of worm event |
CN107092660A (en) * | 2017-03-28 | 2017-08-25 | 成都优易数据有限公司 | A kind of Website server reptile recognition methods and device |
CN107800684A (en) * | 2017-09-20 | 2018-03-13 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
Non-Patent Citations (1)
Title |
---|
LIU YU ET AL: "crawler recognition technology based on decision tree algorithm", COMPUTER ENGINEERING AND SOFTWARE, vol. 38, no. 7, 15 July 2017 (2017-07-15), pages 122 - 125 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597372A (en) * | 2020-12-25 | 2021-04-02 | 北京知因智慧科技有限公司 | Distributed crawler implementation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN107800684A (en) | 2018-03-13 |
CN107800684B (en) | 2018-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019057048A1 (en) | Low-frequency crawler identification method, device, readable storage medium and equipment | |
US11087329B2 (en) | Method and apparatus of identifying a transaction risk | |
CN109922032B (en) | Method, device, equipment and storage medium for determining risk of logging in account | |
US10496678B1 (en) | Systems and methods for generating and implementing knowledge graphs for knowledge representation and analysis | |
WO2018170454A2 (en) | Using different data sources for a predictive model | |
US20190171724A1 (en) | Method and apparatus for determining hot event | |
CN109413175B (en) | Information processing method and device and electronic equipment | |
US20170171336A1 (en) | Method and electronic device for information recommendation | |
US10810106B1 (en) | Automated application security maturity modeling | |
CN106682906B (en) | Risk identification and service processing method and equipment | |
WO2016015444A1 (en) | Target user determination method, device and network server | |
WO2017070305A8 (en) | System and method for detecting interaction and influence in networks | |
CN108665293B (en) | Feature importance acquisition method and device | |
US9866454B2 (en) | Generating anonymous data from web data | |
US9860328B2 (en) | Associating web page requests in a web access system | |
US20170199803A1 (en) | Duplicate bug report detection using machine learning algorithms and automated feedback incorporation | |
US20180091375A1 (en) | Event-based data path detection | |
CN111797319A (en) | Recommendation method, device, equipment and storage medium | |
CN112231571A (en) | Information data processing method, device, equipment and storage medium | |
CN108133046B (en) | Data analysis method and device | |
CN117370969A (en) | Data anomaly detection method, device, computer equipment and storage medium | |
CN104572820A (en) | Method and device for generating model and method and device for acquiring importance degree | |
US20170171330A1 (en) | Method for pushing information and electronic device | |
KR102136222B1 (en) | System and method for clustering graph data and computer program for the same | |
CN113138960A (en) | Data storage method and system based on cloud storage space adjustment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18858059 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 24/06/2020) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18858059 Country of ref document: EP Kind code of ref document: A1 |