CN110519280B

CN110519280B - Crawler identification method and device, computer equipment and storage medium

Info

Publication number: CN110519280B
Application number: CN201910816727.XA
Authority: CN
Inventors: 欧二强; 邓鑫鑫; 沈仁奎
Original assignee: Beijing Mind Creation Information Technology Co ltd
Current assignee: Beijing Mind Creation Information Technology Co ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-01-04
Anticipated expiration: 2039-08-30
Also published as: CN110519280A

Abstract

The embodiment of the invention discloses a crawler identification method, a crawler identification device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring network identification information of a suspected crawler object; when an access request of the suspected crawler object is received, sending a verification message to the suspected crawler object; the verification message is used for the client side of the suspected crawler object to call a user interaction plug-in; and obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results. The technical scheme of the embodiment of the invention can improve the identification rate of the crawler object.

Description

Crawler identification method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer networks, in particular to a crawler identification method and device, computer equipment and a storage medium.

Background

The web crawler is a program or script for automatically capturing web information according to certain rules. According to statistics, the crawler flow already exceeds the real human access request flow.

At present, the main methods for identifying crawlers in the prior art include: 1. through components such as WAF (Web Application Firewall, Web Application protection system), Firewall, gateway, and the like, identification is performed according to a threshold value of the number of times that the request frequency exceeds the number of normal user accesses, such as IP (Internet Protocol, Protocol for interconnection between networks) or device ID (Identity document). 2. The identification is made according to whether parameters requested by a header (header) and a JWT (Json web token) contain preset hidden values and parameter encryption. 3. And identifying according to the distribution of the access interfaces and the path condition of the page when the page is accessed. The path of the page visited by the normal user is obviously different from the path of the crawler. 4. Crawlers are identified through machine learning in modes of aggregating multiple IPs and device blacklists and intelligently learning various crawler characteristics and the like.

In the process of implementing the invention, the inventor finds that the prior art has the following defects:

the misjudgment rate of identifying the crawler according to the IP and the equipment ID is high, and the crawler can avoid being identified through the IP pool and the equipment ID; the crawler with higher technology can also decompile the application codes to check the request mode, and realize the encryption and decryption request by using the crawler, so as to crawl the content; most of the current methods for identifying crawlers are based on Web requests, but various Application simulator crawlers derived from mobile APP (Application) can simulate similar paths of access of normal users. The existing crawler identification method can only improve the technical threshold of the crawler and cannot accurately identify the behavior of the crawler.

Disclosure of Invention

The embodiment of the invention provides a crawler identification method and device, computer equipment and a storage medium, which are used for improving the identification rate of a crawler object.

In a first aspect, an embodiment of the present invention provides a crawler identification method, including:

acquiring network identification information of a suspected crawler object;

when an access request of the suspected crawler object is received, sending a verification message to the suspected crawler object; the verification message is used for the client side of the suspected crawler object to call a user interaction plug-in;

and obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

In a second aspect, an embodiment of the present invention further provides a crawler identification apparatus, including:

the network identification information acquisition module is used for acquiring network identification information of a suspected crawler object;

the verification message sending module is used for sending a verification message to the suspected crawler object when receiving an access request of the suspected crawler object; the verification message is used for the client side of the suspected crawler object to call a user interaction plug-in;

and the crawler identification result updating module is used for acquiring a plurality of interactive feedback results of the user interactive plug-ins and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the crawler identification method provided by any of the embodiments of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the crawler identification method provided in any embodiment of the present invention.

According to the embodiment of the invention, by acquiring the network identification information of the suspected crawler object, when the access request of the suspected crawler object is received, the verification message for calling the user interaction plug-in at the client side is sent to the suspected crawler object, and the multiple interaction feedback results of the user interaction plug-in are acquired, so that the crawler identification result of the suspected crawler object is updated according to the multiple interaction feedback results, the problem of low identification rate of the existing crawler identification method is solved, and the identification rate of the crawler object is improved.

Drawings

FIG. 1 is a flowchart of a crawler identification method according to an embodiment of the present invention;

FIG. 2a is a flowchart of a crawler identification method according to a second embodiment of the present invention;

fig. 2b is a schematic diagram illustrating a gesture verification identification effect according to a second embodiment of the present invention;

FIG. 3a is a flowchart of a crawler identification method according to a third embodiment of the present invention;

fig. 3b is a schematic flowchart of a crawler object countering method according to a third embodiment of the present invention;

fig. 4 is a schematic view of a crawler recognition apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a crawler recognition method according to an embodiment of the present invention, where the method is applicable to accurately recognize a crawler object, and the method may be executed by a crawler recognition apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device and used in cooperation with a client for performing a crawler recognition function. Accordingly, as shown in fig. 1, the method comprises the following operations:

s110, network identification information of the suspected crawler object is obtained.

The suspected crawler object can be identified by the existing crawler identification method, and has a network program or script and the like of crawler suspicion. The network identification information may be flag information of the suspected crawler object in the network, such as an IP, a device ID, or a user ID of the suspected crawler object.

In the embodiment of the invention, the crawler object can be firstly identified by a series of existing crawler identification methods, and the network identification information of the suspected crawler object is obtained. If the object identified according to the IP and the equipment ID is taken as a suspected crawler object, and the network identification information of the suspected crawler object is obtained. Any method capable of identifying a crawler object can be used as the crawler identification method for acquiring the network identification information of the suspected crawler object in the embodiment of the present invention, which is not limited in the embodiment of the present invention.

S120, when receiving an access request of the suspected crawler object, sending a verification message to the suspected crawler object; wherein the verification message is used for the client of the suspected crawler object to invoke a user interaction plug-in.

The verification message may be a message for verifying the identity of the suspected crawler object. The user interaction plug-in may be used for a user to interact with the server through the client. For example, the user inputs a verification code through the client or performs a verification operation specified by the server to realize interaction with the server.

Specifically, after the suspected crawler object is identified and the network identification information of the suspected crawler object is acquired, in order to accurately identify whether the suspected crawler object is the crawler object, when the server receives the access request of the suspected crawler object again, a verification message capable of calling the user interaction plug-in is sent to the suspected crawler object. Correspondingly, after the client-side related platform APP of the suspected crawler object receives the verification message, the user interaction plug-in can be called back.

In an optional embodiment of the present invention, the sending a verification message to the suspected crawler object when receiving the access request of the suspected crawler object may include: if the access request of the suspected crawler object is determined to meet the preset interaction condition, sending a verification message to the suspected crawler object; wherein the preset interaction condition comprises: and the associated information of the access request reaches an interaction benchmark.

The preset interaction condition may be a condition for judging further identification of the suspected crawler object by using the user interaction plug-in. The associated information of the access request may be associated network information related to the access request, for example, the number and frequency of the access requests, or the network bandwidth occupied by the access requests. The interaction criteria may be a condition that determines that a suspected crawler object may be identified. For example, the number of access requests reaches a set threshold, wherein the set threshold may be a value set according to actual demand, such as 100. The embodiment of the invention does not limit the associated information of the access request and the specific content of the interaction benchmark.

Optionally, only when it is determined that the access request of the suspected crawler object to the server meets the preset interaction condition, the verification message is sent to the suspected crawler object. Illustratively, after the network identification information of a certain suspected crawler object is acquired, if the number of access requests accumulated by the server for the suspected crawler object reaches a set threshold, an operation of further identifying the suspected crawler object by using a user interaction plug-in is triggered, and a verification message capable of calling the user interaction plug-in is sent to the suspected crawler object.

S130, obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

The interaction feedback result may be an execution result fed back by the suspected crawler object for the user interaction plug-in.

Correspondingly, after the client-side related platform APP of the suspected crawler object receives the verification message sent by the server, the user interaction plug-in can be called back. At this time, the operations performed by the real user and the crawler object on the user interaction plug-in are different, so that the interaction feedback results fed back to the server by the real user and the crawler object are different. For example, if the suspected crawler object is a real user, the matching interaction operation may be performed for the user interaction plug-in, and a response message may be fed back for the verification message sent by the server; if the suspected crawler object is indeed a crawler object, the suspected crawler object will not perform the matching interaction operation for the user interaction plug-in, and at the same time, the user interaction plug-in will not block the subsequent crawler behavior of the crawler object. Therefore, the crawler object can ignore the user interaction plug-in to continue to crawl the network data and does not feed back a response message aiming at the verification message sent by the server. Therefore, the server can update the crawler identification result of the suspected crawler object according to the multiple interaction feedback results of the suspected crawler object for the user interaction plug-in, so as to determine whether the suspected crawler object is indeed the crawler object according to the final crawler identification result.

For example, if the suspected crawler object may execute a matching interactive operation for a user interactive plug-in called by the client, and feed back a response message for a verification message sent by the server, the crawler identification result of this time is updated as: the suspected crawler object is temporarily determined to be a real user. And if the server receives the access request of the suspected crawler object again, repeatedly sending verification information to the suspected crawler object and acquiring an interaction feedback result of the user interaction plug-in. If the suspected crawler object can execute the matched interactive operation aiming at the user interactive plug-in called by the client and feed back a response message aiming at the verification message sent by the server, updating the crawler identification result to be: the suspected crawler object is determined to be a real user, so that multi-round interactive confirmation of the server and the client is realized.

Therefore, the embodiment of the invention can realize multi-round interactive confirmation by combining the machine and the user, further identify whether the suspected crawler object is the crawler object, determine the real identity of the suspected crawler object by using the essential difference of the real user and the crawler object to the response user interaction plug-in, effectively improve the accuracy of crawler identification and further improve the identification rate of the crawler object.

According to the embodiment of the invention, by acquiring the network identification information of the suspected crawler object, when the access request of the suspected crawler object is received, the verification message for calling the user interaction plugin by the client is sent to the suspected crawler object, and the multiple interaction feedback results of the user interaction plugin are acquired, so that the crawler identification result of the suspected crawler object is updated according to the interaction feedback results, the problem of low identification rate of the existing crawler identification method is solved, and the identification rate of the crawler object is improved.

Example two

Fig. 2a is a flowchart of a crawler recognition method according to a second embodiment of the present invention, which is embodied based on the above embodiments, and in this embodiment, a specific processing manner of network identification information of a suspected crawler object and a specific implementation manner of updating a crawler recognition result of the suspected crawler object according to the interaction feedback result are provided. Accordingly, as shown in fig. 2a, the method of the present embodiment may include:

s210, network identification information of the suspected crawler object is obtained.

Wherein the network identification information may include, but is not limited to, an IP, a device ID, and a user ID.

In the embodiment of the present invention, optionally, the IP, the device ID, and the user ID may be used as the network identification information at the same time.

S220, adding the network identification information into a preset attention list, and identifying the network identification information through suspicious degree values; and the suspicious degree value is used for identifying a crawler identification result of the suspected crawler object.

The preset attention list may be a preset storage list for storing network identification information of the suspected crawler object. The suspicious degree value may be used to identify a crawler identification result for the suspected crawler object. For example, the suspicious degree value is marked by a percentage value, and the higher the probability that the suspected crawler object is the crawler object, the larger the percentage value corresponding to the suspicious degree value is.

In the embodiment of the present invention, optionally, in order to implement multiple identifications of suspected crawler objects, network identification information may be added to a preset attention list, and the network identification information in the preset attention list is identified according to a suspicious degree value. It is to be understood that the preset focus list may include network identification information of a plurality of suspected crawler objects.

And S230, sending a verification message to the suspected crawler object when receiving the access request of the suspected crawler object.

In an optional embodiment of the present invention, the sending a verification message to the suspected crawler object may include: generating a verification identification character string through a preset encryption algorithm, and adding the verification identification character string to header information to form the verification message; and feeding back the verification message to the client of the suspected crawler object.

The preset Encryption algorithm may be a reversible Encryption algorithm, such as ASE (Advanced Encryption Standard) or RSA (Rivest-Shamir-Adleman), and the like. Any reversible encryption algorithm can be used as the preset encryption algorithm, and the embodiment of the invention does not limit the specific content of the preset encryption algorithm. The authentication identification string may be a string for authentication generated by a preset encryption algorithm.

Specifically, in the embodiment of the present invention, a preset encryption algorithm may be used to generate the verification identifier string, such as X-Dedao-Security: encrypt (id, timestamp, rand). And then, the verification identification character string is added into a header information header returned by the server to the client to form a corresponding verification message. And then feeding back the verification message to the client of the suspected crawler object.

S240, obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

In an optional embodiment of the present invention, the obtaining of the multiple interactive feedback results of the user interactive plugin may include: and if the suspected crawler object completes the response operation of the user interaction plug-in, receiving a response message fed back by the suspected crawler object as an interaction feedback result.

Correspondingly, if the suspected crawler object is a real user, the client of the suspected crawler object can complete the matching response operation aiming at the user interaction plug-in after calling the user interaction plug-in. At this time, the server may receive a response message fed back by the suspected crawler object through the client as an interactive feedback result.

In an optional embodiment of the invention, the response message comprises the authentication identification string; after the receiving the response message fed back by the suspected crawler object, the method may further include: and verifying the response message to confirm the validity of the response message.

Specifically, the response message fed back by the suspected crawler object through the client may also include a verification identification character string. Correspondingly, after receiving the response message fed back by the suspected crawler object, the server may perform decoding verification on the response message to confirm the validity of the response message.

Accordingly, S240 may specifically include the following operations:

and S241, judging whether a response message for the suspected crawler object to execute the user interaction plug-in feedback is received within a preset time, if so, executing S242, and if not, executing S246.

And S242, updating the suspicious degree value according to a first updating rule.

The preset time may be a time value set according to an actual requirement, such as 2 minutes, and the embodiment of the present invention does not limit a specific value of the preset time. The first update rule may be an update rule of a crawler identification result formulated for a suspected crawler object of the user interaction plug-in feedback response message.

Specifically, if the server receives a response message fed back by the suspected crawler object executing the user interaction plug-in within a preset time, the suspicious degree value of the network identifier of the suspected crawler object may be updated according to a first update rule. For example, the suspicious degree value is decreased.

And S243, judging whether the suspicious degree value meets the interaction termination condition, if so, executing S244, and if not, returning to execute S241.

And S244, continuously updating the suspicious degree value according to the degree value influence factors.

The interaction suspension condition may be a condition for determining that the interaction between the server and the suspected crawler object is suspended. For example, the suspicious degree value of the suspected crawler object reaches a preset threshold: 60 percent. The degree value influencing factor may be a factor in the network influencing the suspicious degree value, such as the number or frequency of access requests, etc.

Optionally, when the server determines that the suspicious degree value of the suspected crawler object satisfies the interaction suspension condition, it may be temporarily determined that the suspected crawler object is not a crawler object. However, in order to accurately identify the suspected crawler object, the suspicious degree value may be continuously updated according to the influence factor of the degree value.

S245, when the suspicious degree value is determined to meet a first identification termination condition, terminating updating of the suspicious degree value, and deleting the network identification information of the suspected crawler object from the preset attention list.

The first recognition termination condition may be a condition for determining that the suspected crawler object is a real user instead of the crawler object, and may terminate recognition of the crawler object. For example, when the suspicious degree value reaches another preset threshold value of 50%, the identification may be terminated, and the suspected crawler object may be determined to be a real user.

Correspondingly, for suspected crawler objects which are temporarily considered not to be crawler objects, under the condition that the suspicious degree value is continuously updated according to the influence factors of the degree value, once the suspicious degree value is determined to meet the first identification termination condition, the suspicious degree value can be terminated to be updated. That is, the updating of the crawler recognition result of the suspected crawler object is terminated, the suspected crawler object is determined to be a real user instead of the crawler object, the network identification information of the suspected crawler object is deleted from the preset attention list, and the recognition process of the suspected crawler object is terminated. If the suspicious degree value is under the influence of the influence factors of the degree value, the identification starting condition is triggered again, and if the suspicious degree value reaches 30%, the identification process can be restarted. That is, when receiving an access request of the suspected crawler object, sending a verification message to the suspected crawler object.

And S246, updating the suspicious degree value according to a second updating rule.

The second update rule may be an update rule of a crawler identification result formulated for a suspected crawler object for which the user interaction plug-in does not feed back a response message. For example, the suspicious degree value is increased.

Specifically, if it is determined that a response message fed back by the suspected crawler object executing the user interaction plug-in is not received within the preset time, the suspicious degree value of the network identifier of the suspected crawler object may be updated according to the second update rule.

And S247, judging whether the suspicious degree value meets a second identification termination condition, if so, executing S248, and if not, returning to execute S230.

The second recognition termination condition may be a condition for determining that the suspected crawler object is indeed a crawler object and terminating recognition of the crawler object.

And S248, terminating updating the suspicious degree value, and confirming that the suspected crawler object is a crawler object.

Correspondingly, if the suspicious degree value of the suspected crawler object is determined to meet the second identification termination condition, the updating of the suspicious degree value can be terminated, that is, the updating of the crawler identification result of the suspected crawler object is terminated, and the suspected crawler object is determined to be the crawler object. Otherwise, when the suspicious degree value of the suspected crawler object does not meet the second identification termination condition, the suspected crawler object may be temporarily considered as the crawler object, but in order to realize accurate identification of the suspected crawler object, an operation of sending a verification message to the suspected crawler object when an access request of the suspected crawler object is received may be returned to execute, and the suspected crawler object is continuously identified until the suspected crawler object is determined to be the crawler object indeed.

In an optional embodiment of the invention, the suspicion degree value comprises a suspicion weight value and a validity time; the suspicious weight value is used for identifying whether the suspected crawler object is a crawler object, and the effective time is used for identifying the effective time of the suspicious weight value; the first update rule includes: decreasing the suspect weight value and resetting the validity time; the second update rule includes: increasing the suspicious weight value and resetting the effective time; the interaction suspension condition comprises: the suspicious weight value reaches a first preset threshold value; the first recognition termination condition includes: the suspicious weight value reaches a second preset threshold value; the second recognition termination condition includes: the suspect weight value reaches a third preset threshold.

The suspicious weight value may be used to identify whether the suspected crawler object is a crawler object, for example, to identify the probability that the suspected crawler object is a crawler object by a percentage value. For example, when the suspicious weight value is higher than 60%, it indicates that the suspected crawler object is a crawler object; when the suspicious weight value is lower than 30%, indicating that the suspected crawler object is not a crawler object; when the suspicious weight value is higher than 30% and less than 60%, it indicates that the suspected crawler object is tentatively not a crawler object. Or, directly marking whether the suspected crawler object is a crawler object by setting a format value, such as a positive integer. For example, when the suspicious weight value is higher than 100, it indicates that the suspected crawler object is a crawler object; when the suspicious weight value is lower than 30, the suspected crawler object is not a crawler object; when the suspicious weight value is higher than 30 and less than 100, it indicates that the suspected crawler object is tentatively not a crawler object. The validity time may be used to identify the time of the suspect weight value to take effect. For example, assuming that the validity time is 12 hours and the timer starts at 0:00, 8/9, and 14/h, the suspicious weight value corresponding to the current suspected crawler object is 80. If the suspicious weight value corresponding to the current suspected crawler object is 50 or 100 at 8:00 of 8, 14 and 8 of 2019, the valid time is reset at 8, 14 and 8:00 of 2019. That is, the validity period starts at 8 months and 14 days 8:00 in 2019. . The first preset threshold, the second preset threshold and the third preset threshold may be values set according to actual requirements, such as 80%, 50%, 90%, and the like. Meanwhile, other preset thresholds, such as a fourth preset threshold and the like, can be set according to actual requirements, and are used for identifying more identification stages in the identification process of the suspected crawler object.

Specifically, if the server determines that a response message fed back by the suspected crawler object executing the user interaction plug-in is received, the server reduces a suspicious weight value of the suspected crawler object and resets the validity time. And if the suspicious weight value does not meet the interaction suspension condition, namely the suspicious weight value does not reach a first preset threshold value, returning to execute the operation of obtaining the interaction feedback result of the user interaction plug-in, and reentering the identification stage to update the suspicious weight value of the suspected crawler object. And when the suspicious degree value meets the interaction suspension condition, the suspicious weight value reaches a first preset threshold value, and the suspicious degree value is continuously updated according to the influence factors of the degree value. And when the suspicious degree value is determined to meet the first identification termination condition, the suspicious weight value reaches a second preset threshold value, the suspicious degree value is terminated to be updated, and the network identification information of the suspected crawler object is deleted from a preset attention list. And when the suspicious degree value is determined to trigger the identification starting condition again, and when an access request of the suspected crawler object is received, continuously sending a verification message to the suspected crawler object to reenter the identification process. If the server does not accept the response message fed back by the suspected crawler object executing the user interaction plug-in, the suspicious weight value of the suspected crawler object is increased, the validity time is reset, and when the access request of the suspected crawler object is received, the server can send a verification message to the suspected crawler object at irregular intervals to continue entering the identification process. And once the suspicious weight value is determined to reach the third preset threshold value, stopping updating the suspicious degree value, and determining the suspected crawler object as the crawler object.

In an optional embodiment of the present invention, the updating the crawler identification result of the suspected crawler object according to the interaction feedback result may further include: if it is determined that the suspect weight value has not changed within the validity time, then the suspect weight value is decreased.

Correspondingly, if the suspicious weight value is not changed within the valid time, which indicates that the probability that the suspected crawler object is the crawler object is low, the suspicious weight value can be reduced. Assuming that the valid time is 24 hours, and the timing is started at 2019, 8, 14, 0:00, the suspicious weight value corresponding to the current suspected crawler object is 50. If the suspicious weight value corresponding to the current suspected crawler object is kept constant at 50 all the time within the time of 14 days 0:00 at 8 months and 14 days in 2019 and 15 days in 8 months and 15 months in 2019, the suspicious weight value corresponding to the current suspected crawler object is reduced to 30, and monitoring of the suspicious weight value of the current suspected crawler object is restarted at 0:00 at 8 months and 15 days in 2019.

In an optional embodiment of the present invention, the user interaction plugin is configured to display a verification identifier to the client of the suspected crawler object through a set rule.

The set rule may be a preset display rule of the verification identifier, such as directly displaying in a display page, or displaying in a mask form. The verification identifier may be an identifier used for verifying the identity of the suspected crawler object, such as a gesture diagram, a verification code, or a mathematical calculation formula. The embodiment of the invention does not limit the specific form of the verification identifier.

Optionally, in the embodiment of the present invention, the user interaction plug-in may display the verification identifier on the client of the suspected crawler object through a set rule.

In an optional embodiment of the present invention, the verification identifier comprises a gesture verification identifier map; the setting rule comprises the following steps: and synchronously or asynchronously displaying the verification identification in the interface in a covering layer mode.

Fig. 2b is a schematic diagram of a gesture verification identification effect according to the second embodiment of the present invention. In a specific example, as shown in fig. 2b, the verification identifier may be a graph of gesture verification identifiers. Accordingly, the gesture verification identifier map may be displayed synchronously or asynchronously in the client interface. The synchronous display means that the client terminal immediately displays after receiving the verification message, and the asynchronous display means that the client terminal delays to display for a period of time after receiving the verification message. Optionally, the gesture verification identifier map may be set to be displayed within a preset time period, for example, 1 minute, and once the display time expires, the gesture verification identifier map is not displayed any more, so as to prevent the crawler object from simulating manual operation to perform verification operation on the gesture verification identifier map.

In a specific example, according to various existing means for identifying crawlers, the IP, the device ID and the user ID of the identified suspected crawler object are put into an attention ID list, and a suspicious weight value and an effective time are identified for next accurate identification. When the server side confirms that the access requests of a suspected crawler object in the concerned ID list reach a certain number, a reversible encryption algorithm is added into a header returned to the client side to generate a verification identification character string, such as X-Dedao-Security: encrypt (ID, timestamp, rand). Correspondingly, the App (web, android, IOS, ipad or E-book) resolves the header through the network library, and asynchronously calls back the components in the platform to pop up the gesture verification identifier chart shown in FIG. 2b once the X-Dedao-Security is recognized. The gesture verification identification graph can be displayed in a display page of the client side in a covering layer mode, the display mode can block a display interface of a user, and crawling of a crawler object to obtain the current data content cannot be influenced. If the suspected crawler object is a real user, the verification is completed only by sliding the gesture verification identification chart in time, verification identification character string X-Dedao-Security information is returned to the server side, and the server side decodes and verifies the validity. After the user completes verification within a preset time interval, the service end resets the life cycle of the suspected crawler object, for example, reduces the suspicious weight value and resets the effective time, and enters the identification stage again. In a cycle, if the suspicious weight value is smaller than the first preset threshold value, the header containing the verification identification character string is not issued any more within a period of time. At this point, the server may continue to update the suspicious level value according to the level-affecting factor. For example, when the number of access requests of the suspected crawler object in 12 hours is not much different from the number of access requests of normal users, the suspicious weight value may be continuously decreased and the valid time may be reset. And once the suspicious weight value is lower than a second preset threshold value, the suspected crawler object is considered as a real user, and the network identification information of the suspected crawler object can be moved out of the attention ID list. And if the suspected crawler object is the crawler object, the suspected crawler object cannot correctly verify the gesture verification marker map within a preset time interval. At this time, the server will increase the suspicious weight value of the suspected crawler object and reset the validity time, and issue a gesture verification request to the subsequent access request of the suspected crawler object at irregular intervals. With the increase of the suspicious weight value of the suspected crawler object, the frequency of issuing the gesture verification requirement is also increased, and once the suspicious weight value reaches a third preset threshold value, the suspected crawler object can be determined to be the crawler object indeed.

Therefore, the crawler identification method provided by the embodiment of the invention realizes identification of the crawler object by combining a machine and a user through multi-round interactive confirmation. The repeated identification confirmation can avoid the error identification, improve the identification accuracy, avoid the complex operation of the user and reduce the interference to the user. And when a subsequent request of the application end is needed, the information such as the token or the verification code is used, so that the possibility of cracking by the crawler is avoided being increased. In addition, because only the platform APP can integrate the gesture verification identifier map with the callback customized version, even if the crawler identification rule is revealed, and the crawler object is integrated with the built-in popup window assembly, higher cracking cost is required. In addition, even if the crawler object uses the simulator, under the condition of no manual access, the filial piety gesture is difficult to confirm accurately, so that the difficulty of cracking the crawler object can be effectively improved, and the whole process can be realized automatically.

According to the embodiment of the invention, the network identification information of the suspected crawler object is added into the preset attention list, and the network identification information is identified through the suspicious degree value, so that the suspicious degree value is updated according to the multiple interactive feedback results of the acquired user interactive plug-in, the updating process of the crawler identification result of the suspected crawler object is realized, and the identification rate of the crawler object can be effectively improved.

It should be noted that any permutation and combination between the technical features in the above embodiments also belong to the scope of the present invention.

EXAMPLE III

Fig. 3a is a flowchart of a crawler identification method according to a third embodiment of the present invention, which is embodied on the basis of the foregoing embodiments, and in this embodiment, specific operations after updating the crawler identification result of the suspected crawler object according to the interaction feedback result are given. Accordingly, as shown in fig. 3a, the method of the present embodiment may include:

s310, network identification information of the suspected crawler object is obtained.

S320, adding the network identification information into a preset attention list, and identifying the network identification information through suspicious degree values.

S330, when receiving the access request of the suspected crawler object, sending a verification message to the suspected crawler object.

S340, obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

And S350, if the suspected crawler object is determined to be the crawler object according to the crawler identification result, constructing preset simulation data according to the access request of the crawler object.

The preset simulation data may be simulation data generated according to a data structure of an interface requested by the crawler object.

In the embodiment of the invention, if the suspected crawler object is determined to be the crawler object according to the crawler identification result, the preset simulation data can be constructed according to the access request of the crawler object.

Fig. 3b is a schematic flow chart of a crawler object countering method according to a third embodiment of the present invention. For example, as shown in fig. 3b, the crawler sending server is provided with a mock service and a management server, and may generate mock data according to a data structure of an interface of an access request of a crawler object. For example, the price of the product is a numerical type, and the mock service randomly generates a numerical value as the product price. For example, if the connection address is a character string type, the mock service will also randomly compose a meaningless content or other error address as the connection address.

And S360, sending the preset simulation data to the crawler object.

Correspondingly, the server can send the generated preset simulation data to the crawler object. Along with the increase of crawler object crawl data, it also can be more to predetermine the dirty data of mistake that the analog data formed, mixes together with the data of crawling before, and the attacker just needs to spend more human costs and screens to promote and crawl the cost, reach the effect of counterchecking the crawler object.

And S370, if the crawling behavior of the crawler object is determined to meet the condition of forbidden processing, carrying out forbidden processing on the crawler object.

The blocking processing condition may be a trigger condition for blocking the crawler object. For example, crawling behavior of crawler objects occupies a major network bandwidth.

Correspondingly, in the embodiment of the invention, in order to inhibit the crazy crawling behavior of the crawler object and prevent an attacker from maliciously attacking the server, the crawler object meeting the blocking processing condition can be blocked. For example, the IP, the device ID and the user ID of the crawler object occupying the network bandwidth maliciously are forbidden.

It should be noted that fig. 3a is only a schematic diagram of an implementation manner, and there is no precedence relationship between S350-S360 and S370, S350-S360 and S370 may be implemented first, or S370 and S350-S360 may be implemented first, or both may be implemented in parallel or alternatively.

According to the embodiment of the invention, the constructed preset simulation data is sent to the crawler object, and when the crawling behavior of the crawler object is determined to meet the prohibition processing condition, the crawler object is prohibited, so that the crawler object can be effectively countered.

Example four

Fig. 4 is a schematic view of a crawler recognition apparatus according to a fourth embodiment of the present invention, and as shown in fig. 4, the apparatus includes: a network identification information obtaining module 410, a verification message sending module 420 and a crawler identification result updating module 430, wherein:

a network identification information obtaining module 410, configured to obtain network identification information of a suspected crawler object;

a verification message sending module 420, configured to send a verification message to the suspected crawler object when receiving an access request of the suspected crawler object; the verification message is used for the client side of the suspected crawler object to call a user interaction plug-in;

and a crawler identification result updating module 430, configured to obtain multiple interaction feedback results of the user interaction plug-in, and update a crawler identification result of the suspected crawler object according to the interaction feedback results.

Optionally, the network identification information includes an IP, a device ID, and a user ID; the device further comprises: the network identification information identification module is used for adding the network identification information into a preset attention list and identifying the network identification information through suspicious degree values; and the suspicious degree value is used for identifying a crawler identification result of the suspected crawler object.

Optionally, the crawler identification result updating module 430 is specifically configured to: if the suspected crawler object is determined to receive a response message fed back by the user interaction plug-in executed within the preset time, updating the suspicious degree value according to a first updating rule; and returning to execute the operation of obtaining the multiple interactive feedback results of the user interactive plug-ins until the suspicious degree value meets the interactive stopping condition.

Optionally, the crawler identification result updating module 430 is specifically configured to: if the suspicious degree value is determined to meet the interaction suspension condition, continuously updating the suspicious degree value according to the influence factors of the degree value; and when the suspicious degree value is determined to meet a first recognition termination condition, terminating updating the suspicious degree value, and deleting the network identification information of the suspected crawler object from the preset attention list.

Optionally, the crawler identification result updating module 430 is specifically configured to: if the suspected crawler object is determined not to receive a response message fed back by the user interaction plug-in executed within the preset time, updating the suspicious degree value according to a second updating rule; and returning to execute the operation of sending a verification message to the suspected crawler object when the access request of the suspected crawler object is received until the suspicious degree value is determined to meet a second identification termination condition.

Optionally, the suspicious degree value includes a suspicious weight value and a valid time; the suspicious weight value is used for identifying whether the suspected crawler object is a crawler object, and the effective time is used for identifying the effective time of the suspicious weight value; the first update rule includes: decreasing the suspect weight value and resetting the validity time; the second update rule includes: increasing the suspicious weight value and resetting the effective time; the interaction suspension condition comprises: the suspicious weight value reaches a first preset threshold value; the first recognition termination condition includes: the suspicious weight value reaches a second preset threshold value; the second recognition termination condition includes: the suspect weight value reaches a third preset threshold.

Optionally, the crawler identification result updating module 430 is further configured to: if it is determined that the suspect weight value has not changed within the validity time, then the suspect weight value is decreased.

Optionally, the verification message sending module 420 is specifically configured to: if the access request of the suspected crawler object is determined to meet the preset interaction condition, sending a verification message to the suspected crawler object; wherein the preset interaction condition comprises: and the associated information of the access request reaches an interaction benchmark.

Optionally, the verification message sending module 420 is specifically configured to: generating a verification identification character string through a preset encryption algorithm, and adding the verification identification character string to header information to form the verification message; and feeding back the verification message to the client of the suspected crawler object.

Optionally, the crawler identification result updating module 430 is specifically configured to: and if the suspected crawler object completes the response operation of the user interaction plug-in, receiving a response message fed back by the suspected crawler object as an interaction feedback result.

Optionally, the response message includes the verification identification string; the crawler identification result updating module 430 is further configured to: and verifying the response message to confirm the validity of the response message.

Optionally, the user interaction plug-in is configured to display a verification identifier to the client of the suspected crawler object through a set rule.

Optionally, the verification identifier includes a gesture verification identifier map; the setting rule comprises the following steps: and synchronously or asynchronously displaying the verification identification in the interface in a covering layer mode.

Optionally, the apparatus further comprises: the preset simulation data construction module is used for constructing preset simulation data according to an access request of the crawler object if the suspected crawler object is determined to be the crawler object according to the crawler identification result; and the preset simulation data sending module is used for sending the preset simulation data to the crawler object.

Optionally, the apparatus further comprises: and the crawler object blocking processing module is used for carrying out blocking processing on the crawler object if the crawling behavior of the crawler object is determined to meet the blocking processing condition.

The crawler recognition device can execute the crawler recognition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the crawler identification method provided in any embodiment of the present invention.

Since the above-described crawler recognition apparatus is an apparatus capable of executing the crawler recognition method in the embodiment of the present invention, based on the crawler recognition method described in the embodiment of the present invention, those skilled in the art can understand the specific implementation manner of the crawler recognition apparatus in the embodiment and various variations thereof, and therefore, how the crawler recognition apparatus implements the crawler recognition method in the embodiment of the present invention is not described in detail herein. The device used by the crawler recognition method in the embodiment of the present invention is all within the scope of the protection of the present application.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. FIG. 5 illustrates a block diagram of a computer device 512 suitable for use in implementing embodiments of the present invention. The computer device 512 shown in FIG. 5 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention. Computer device 512 is typically a computer device that assumes the functionality of a server.

As shown in FIG. 5, computer device 512 is in the form of a general purpose computing device. Components of computer device 512 may include, but are not limited to: one or more processors 516, a storage device 528, and a bus 518 that couples the various system components including the storage device 528 and the processors 516.

Bus 518 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computer device 512 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 512 and includes both volatile and nonvolatile media, removable and non-removable media.

Storage 528 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 530 and/or cache Memory 532. The computer device 512 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 534 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 518 through one or more data media interfaces. Storage 528 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

Program 536 having a set (at least one) of program modules 526 may be stored, for example, in storage 528, such program modules 526 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination may include an implementation of a network environment. Program modules 526 generally perform the functions and/or methodologies of the described embodiments of the invention.

Computer device 512 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device, camera, display 524, etc.), with one or more devices that enable a user to interact with computer device 512, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 512 to communicate with one or more other computing devices. Such communication may be through an Input/Output (I/O) interface 522. Further, computer device 512 may also communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) via Network adapter 520. As shown, the network adapter 520 communicates with the other modules of the computer device 512 via the bus 518. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the computer device 512, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.

The processor 516 executes various functional applications and data processing, such as implementing the crawler recognition method provided by the above-described embodiments of the present invention, by executing programs stored in the storage device 528.

That is, the processing unit implements, when executing the program: acquiring network identification information of a suspected crawler object; when an access request of the suspected crawler object is received, sending a verification message to the suspected crawler object; the verification message is used for the client side of the suspected crawler object to call a user interaction plug-in; and obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

EXAMPLE six

An embodiment of the present invention further provides a computer storage medium storing a computer program, where the computer program is used to execute the crawler identification method according to any one of the above embodiments of the present invention when executed by a computer processor: acquiring network identification information of a suspected crawler object; when an access request of the suspected crawler object is received, sending a verification message to the suspected crawler object; the verification message is used for the client side of the suspected crawler object to call a user interaction plug-in; and obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A crawler recognition method, comprising:

acquiring network identification information of a suspected crawler object;

obtaining multiple interactive feedback results of the user interactive plug-ins, and updating the crawler identification result of the suspected crawler object according to the interactive feedback results;

after updating the crawler identification result of the suspected crawler object according to the interaction feedback result, the method further comprises the following steps:

if the suspected crawler object is determined to be a crawler object according to the crawler identification result, constructing preset simulation data according to an access request of the crawler object;

and sending the preset simulation data to the crawler object, wherein the preset simulation data is mixed with the data crawled before the crawler object.

2. The method of claim 1, wherein the network identification information comprises an IP, a device ID, and a user ID;

after the network identification information of the suspected crawler object is obtained, the method further comprises the following steps:

adding the network identification information into a preset attention list, and identifying the network identification information through a suspicious degree value; and the suspicious degree value is used for identifying a crawler identification result of the suspected crawler object.

3. The method of claim 2, wherein the updating the crawler identification result of the suspected crawler object according to the interaction feedback result comprises:

if the suspected crawler object is determined to receive a response message fed back by the user interaction plug-in executed within the preset time, updating the suspicious degree value according to a first updating rule;

and returning to execute the operation of obtaining the multiple interactive feedback results of the user interactive plug-ins until the suspicious degree value meets the interactive stopping condition.

4. The method of claim 3, wherein: the updating the crawler identification result of the suspected crawler object according to the interactive feedback result comprises:

if the suspicious degree value is determined to meet the interaction suspension condition, continuously updating the suspicious degree value according to the influence factors of the degree value;

and when the suspicious degree value is determined to meet a first recognition termination condition, terminating updating the suspicious degree value, and deleting the network identification information of the suspected crawler object from the preset attention list.

5. The method of claim 4, wherein the updating the crawler identification result of the suspected crawler object according to the interaction feedback result comprises:

if the suspected crawler object is determined not to receive a response message fed back by the user interaction plug-in executed within the preset time, updating the suspicious degree value according to a second updating rule;

and returning to execute the operation of sending a verification message to the suspected crawler object when the access request of the suspected crawler object is received until the suspicious degree value is determined to meet a second identification termination condition.

6. The method of claim 5, wherein:

the suspicious degree value comprises a suspicious weight value and a valid time; the suspicious weight value is used for identifying whether the suspected crawler object is a crawler object, and the effective time is used for identifying the effective time of the suspicious weight value;

the first update rule includes: decreasing the suspect weight value and resetting the validity time;

the second update rule includes: increasing the suspicious weight value and resetting the effective time;

the interaction suspension condition comprises: the suspicious weight value reaches a first preset threshold value;

the first recognition termination condition includes: the suspicious weight value reaches a second preset threshold value;

the second recognition termination condition includes: the suspect weight value reaches a third preset threshold.

7. The method of claim 6, wherein the updating the crawler identification result of the suspected crawler object according to the interaction feedback result further comprises:

if it is determined that the suspect weight value has not changed within the validity time, then the suspect weight value is decreased.

8. The method of claim 1, wherein sending a validation message to the suspected crawler object upon receiving the request for access to the suspected crawler object comprises:

if the access request of the suspected crawler object is determined to meet the preset interaction condition, sending a verification message to the suspected crawler object;

wherein the preset interaction condition comprises: and the associated information of the access request reaches an interaction benchmark.

9. The method of claim 8, wherein sending a validation message to the suspected crawler object comprises:

generating a verification identification character string through a preset encryption algorithm, and adding the verification identification character string to header information to form the verification message;

and feeding back the verification message to the client of the suspected crawler object.

10. The method of claim 9, wherein obtaining multiple interactive feedback results of the user interactive plugin comprises:

and if the suspected crawler object completes the response operation of the user interaction plug-in, receiving a response message fed back by the suspected crawler object as an interaction feedback result.

11. The method of claim 10, wherein the response message includes the authentication identification string;

after the receiving the response message fed back by the suspected crawler object, the method further includes:

and verifying the response message to confirm the validity of the response message.

12. The method of any of claims 1-11, wherein the user interaction plugin is configured to display a validation identifier to the client of the suspected crawler object through a set rule.

13. The method of claim 12, wherein the verification identifier comprises a gesture verification identifier map;

the setting rule comprises the following steps: and synchronously or asynchronously displaying the verification identification in the interface in a covering layer mode.

14. The method of claim 1, further comprising:

and if the crawling behavior of the crawler object is determined to meet the condition of forbidden processing, carrying out forbidden processing on the crawler object.

15. A crawler recognition apparatus, comprising:

the crawler identification result updating module is used for acquiring a plurality of interactive feedback results of the user interactive plug-in and updating the crawler identification result of the suspected crawler object according to the interactive feedback results;

the crawler recognition apparatus further includes: the preset simulation data construction module is used for constructing preset simulation data according to an access request of the crawler object if the suspected crawler object is determined to be the crawler object according to the crawler identification result; and the preset simulation data sending module is used for sending the preset simulation data to the crawler object, and the preset simulation data and the previously crawled data of the crawler object are mixed together.

16. A computer device, the device comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the crawler identification method of any of claims 1-14.

17. A computer storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, implements the crawler identification method according to any one of claims 1-14.