[go: up one dir, main page]

CN110851680B - Web crawler identification method and device - Google Patents

Web crawler identification method and device Download PDF

Info

Publication number
CN110851680B
CN110851680B CN201910957170.1A CN201910957170A CN110851680B CN 110851680 B CN110851680 B CN 110851680B CN 201910957170 A CN201910957170 A CN 201910957170A CN 110851680 B CN110851680 B CN 110851680B
Authority
CN
China
Prior art keywords
client
web crawler
url
connection information
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910957170.1A
Other languages
Chinese (zh)
Other versions
CN110851680A (en
Inventor
周高明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910957170.1A priority Critical patent/CN110851680B/en
Publication of CN110851680A publication Critical patent/CN110851680A/en
Application granted granted Critical
Publication of CN110851680B publication Critical patent/CN110851680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The application provides a web crawler identification method and device, wherein the web crawler identification method comprises the following steps: receiving a picture of a webpage and a URL of the webpage, which are sent by a client after the webpage is rendered; acquiring a sample picture according to the URL; and according to the comparison of the similarity and a preset threshold, identifying whether the client is a web crawler, wherein the similarity is the similarity between the picture of the webpage and the sample picture. The method has higher reliability for identifying the web crawlers, does not influence the fluency of browsing the web pages of normal users, and can greatly consume the resources of the web crawlers even if the web crawlers crack the identification method, so that the frequency of the web crawlers to access the web pages is reduced.

Description

Web crawler identification method and device
Technical Field
The application relates to the technical field of Internet, in particular to a method and a device for identifying web crawlers.
Background
The existing normal users who access the web page browse the web page through clients such as a browser and the like, and the web crawlers. The web crawler is a computer program for automatically crawling web pages.
Because the web crawler does not need to render the page, only the file content and the uniform resource locator (Uniform Resource Locator; hereinafter referred to as URL) in the file are required to be obtained, the web crawler can access the web server with a very high frequency, so that the access of normal users of the web page is influenced, even some web pages are not hoped to be grabbed by the crawler, and whether the web page is a crawler or a normal user currently accessed needs to be identified, so that the access of the crawler is prevented or the access frequency of the crawler is reduced.
However, the existing technology for identifying the web crawlers has the problems of low reliability and accuracy and influence on the smoothness of browsing the web pages of normal users.
Disclosure of Invention
The object of the present application is to solve at least one of the technical problems in the related art to some extent.
For this purpose, a first object of the present application is to propose a method for identifying web crawlers. The method has higher reliability on the recognition of the web crawlers, does not influence the fluency of browsing the web pages of normal users, and can greatly consume the resources of the web crawlers even if the web crawlers crack the recognition method, thereby reducing the frequency of the web crawlers to access the web pages.
A second object of the present application is to propose an identification device for a web crawler.
In order to achieve the above object, a web crawler identification method according to an embodiment of the first aspect of the present application includes: receiving a picture of a webpage and a URL of the webpage, which are sent by a client after the webpage is rendered; acquiring a sample picture according to the URL; and according to the comparison of the similarity and a preset threshold, identifying whether the client is a web crawler, wherein the similarity is the similarity between the picture of the webpage and the sample picture.
According to the web crawler identification method, after receiving the picture of the webpage and the URL of the webpage sent by the client after the webpage is rendered, the server obtains a sample picture according to the URL, and then identifies whether the client is a web crawler according to the comparison of the similarity of the picture of the webpage and the sample picture with the preset threshold value.
In order to achieve the above object, a web crawler identification method according to an embodiment of a second aspect of the present application includes: after the webpage is rendered, the client acquires the picture of the webpage which is rendered currently and the URL of the webpage; the client sends the picture of the webpage and the URL of the webpage to a server so that the server can acquire a sample picture according to the URL, and identify whether the client is a web crawler or not according to comparison of similarity and a preset threshold, wherein the similarity is the similarity between the picture of the webpage and the sample picture.
According to the web crawler identification method, after webpage rendering is completed, a client acquires a picture of a webpage which is currently rendered and a URL of the webpage, and sends the picture of the webpage and the URL of the webpage to a server, so that the server acquires a sample picture according to the URL, and identifies whether the client is a web crawler according to comparison of similarity of the picture of the webpage and the sample picture with a preset threshold value. The method has higher reliability on the recognition of the web crawlers, does not influence the fluency of browsing the web pages of normal users, and can greatly consume the resources of the web crawlers even if the web crawlers crack the recognition method, thereby reducing the frequency of the web crawlers to access the web pages.
In order to achieve the above object, a web crawler identification method according to an embodiment of a third aspect of the present application includes: receiving connection information of a client, wherein the connection information of the client comprises an IP address of the client and connection time of the client; and if the connection information of the client is in the client library to be verified and the time of the connection information of the client in the client library to be verified exceeds the preset duration, identifying the client as a web crawler.
According to the web crawler identification method, after the connection information of the client is received, if the connection information of the client is in the client library to be verified and the time of the connection information of the client in the client library to be verified exceeds the preset duration, the client is identified as the web crawler. The method has high reliability on the identification of the web crawlers, does not influence the fluency of browsing the web pages of normal users, and has good user experience.
In order to achieve the above object, a web crawler identification apparatus according to an embodiment of the fourth aspect of the present application includes: the receiving module is used for receiving the picture of the webpage and the URL of the webpage, which are sent by the client after the webpage is rendered; the acquisition module is used for acquiring a sample picture according to the URL; the identification module is used for identifying whether the client is a web crawler or not according to the comparison of the similarity and a preset threshold, wherein the similarity is the similarity between the picture of the webpage received by the receiving module and the sample picture acquired by the acquisition module.
According to the web crawler identification device, after the receiving module receives the picture of the webpage and the URL of the webpage sent by the client after the webpage is rendered, the obtaining module obtains a sample picture according to the URL, and then the identification module identifies whether the client is a web crawler according to the comparison of the similarity of the picture of the webpage and the sample picture with the preset threshold value, the device is high in reliability of web crawler identification, the smoothness of browsing the webpage of a normal user is not affected, even if the web crawler breaks through the identification method, resources of the web crawler are consumed greatly, and the frequency of web crawler access is reduced.
To achieve the above object, a web crawler identification apparatus according to an embodiment of a fifth aspect of the present application includes: the acquisition module is used for acquiring the picture of the currently rendered webpage and the URL of the webpage after the webpage is rendered; the sending module is used for sending the picture of the webpage and the URL of the webpage acquired by the acquiring module to a server, so that the server acquires a sample picture according to the URL, and identifies whether the client is a web crawler or not according to comparison of similarity and a preset threshold, wherein the similarity is the similarity between the picture of the webpage and the sample picture.
According to the web crawler identification device, after webpage rendering is completed, the acquisition module acquires the picture of the webpage which is currently rendered and the URL of the webpage, and the transmission module transmits the picture of the webpage and the URL of the webpage to the server, so that the server acquires a sample picture according to the URL, and identifies whether the client is a web crawler according to the similarity between the picture of the webpage and the sample picture and the comparison between the similarity between the picture of the webpage and the sample picture and the preset threshold. The device is higher in reliability of web crawler identification, the fluency of browsing the webpage by a normal user is not affected, even if the web crawler breaks through the identification method, the resource of the web crawler can be greatly consumed, and the frequency of the web crawler accessing the webpage is reduced.
In order to achieve the above object, a web crawler identification apparatus according to an embodiment of a sixth aspect of the present application includes: the receiving module is used for receiving the connection information of the client, wherein the connection information of the client comprises the IP address of the client and the connection time of the client; and the identification module is used for identifying the client as a web crawler when the connection information of the client received by the receiving module is in a client library to be verified and the connection information of the client exists in the client library to be verified for more than a preset duration.
According to the web crawler identification device, after the receiving module receives the connection information of the client, if the connection information of the client is in the client library to be verified and the time that the connection information of the client exists in the client library to be verified exceeds the preset duration, the identification module identifies that the client is a web crawler. The device is higher to the reliability of web crawler discernment, does not influence the smoothness that normal user browsed the webpage, and user experience is better.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of one embodiment of a web crawler identification method of the present application;
FIG. 2 is a flow chart of another embodiment of a web crawler identification method of the present application;
FIG. 3 is a flow chart of yet another embodiment of a web crawler identification method of the present application;
FIG. 4 is a flow chart of yet another embodiment of a web crawler identification method of the present application;
FIG. 5 is a flow chart of yet another embodiment of a web crawler identification method of the present application;
FIG. 6 is a schematic diagram illustrating the structure of an embodiment of a web crawler recognition device according to the present application;
FIG. 7 is a schematic diagram illustrating a configuration of another embodiment of a web crawler recognition device according to the present application;
FIG. 8 is a schematic diagram illustrating a configuration of a web crawler recognition device according to another embodiment of the present application;
FIG. 9 is a schematic diagram illustrating a configuration of a web crawler recognition device according to another embodiment of the present application;
fig. 10 is a schematic structural diagram of a web crawler recognition device according to still another embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the present application include all alternatives, modifications, and equivalents as may be included within the spirit and scope of the appended claims.
FIG. 1 is a flowchart of one embodiment of a method for identifying a web crawler according to the present application, as shown in FIG. 1, the method for identifying a web crawler may include:
step 101, receiving a picture of the webpage and a URL of the webpage, which are sent by a client after the webpage is rendered.
And 102, acquiring a sample picture according to the URL.
Specifically, the obtaining the sample picture according to the URL may be: the server searches a picture matching the URL and the size in a sample picture library according to the size of the picture of the webpage and the URL; if the URL is found, outputting a picture matching the URL and the size; if the picture matching the URL and the size is not found in the sample picture library, the picture matching the URL is found in the sample picture library, the picture closest to the size is found in the found picture matching the URL, and the found picture closest to the size is output.
Further, if no picture matching the URL is found in the sample picture library, or if a page of an existing URL in the sample picture library is modified, the server generates a picture of at least one rendering size supported by the URL, and stores the generated picture in the sample picture library as a sample picture of the URL.
Step 103, identifying whether the client is a web crawler according to the comparison between the similarity and a preset threshold, wherein the similarity is the similarity between the picture of the webpage and the sample picture.
Specifically, if the similarity between the picture of the web page and the sample picture is greater than a preset threshold, it is identified that the client is not a web crawler.
The preset threshold may be dynamically set during specific implementation, and the size of the preset threshold is not limited in this embodiment.
Further, before step 102, the server may further determine whether the connection information of the client and the URL are in a client library to be verified; if so, the server performs step 102 of obtaining a sample picture according to the URL. The connection information of the client may include an internet protocol (Internet Protocol; hereinafter abbreviated as IP) address of the client, a connection time of the client, a user agent (user agent), a user identifier of the client, and the like, which may mark the client.
Further, step 103, before identifying that the client is not a web crawler, if the similarity between the picture of the web page and the sample picture is greater than a preset threshold, the server deletes the URL of the web page from the client library to be verified, and then determines whether there are other URLs to be verified corresponding to the client in the client library to be verified; if not, the server performs step 103 of identifying that the client is not a web crawler.
Further, before step 101, the server may further process a normal page access request of the client, which specifically includes: the server receives connection information of the client and a URL (uniform resource locator) currently accessed by the client, wherein the connection information of the client can comprise information which can mark the client, such as an IP (Internet protocol) address of the client, connection time of the client, a user agent, a user identifier of the client and the like; then the server judges whether the connection information of the client is in a client library to be verified; if not, the server stores the connection information of the client and the URL currently accessed by the client into a client library to be verified.
Further, after judging whether the connection information of the client is in the client library to be verified, if the connection information of the client is in the client library to be verified, the server judges whether the time of the connection information of the client in the client library to be verified exceeds a preset time length; if yes, the server identifies the client as a web crawler, and stores the connection information of the client into a web crawler library; and if the time of the connection information of the client in the client library to be verified does not exceed the preset time, the server stores the connection information of the client and the URL currently accessed by the client in the client library to be verified.
The preset duration may be dynamically set according to a service form when the preset duration is specifically implemented, and in this embodiment, the length of the preset duration is not limited, for example, the preset duration may be 10 seconds.
Further, before judging whether the connection information of the client is in the client library to be verified, the server may also judge whether the connection information of the client is in the web crawler library; if yes, the client is identified as a web crawler; if the connection information of the client is not in the web crawler library, the server executes the step of judging whether the connection information of the client is in the client library to be verified.
In the web crawler identification method, after receiving the picture of the webpage and the URL of the webpage sent by the client after the webpage is rendered, the server acquires a sample picture according to the URL, and then identifies whether the client is a web crawler according to the comparison of the similarity of the picture of the webpage and the sample picture with a preset threshold value.
Fig. 2 is a flowchart of another embodiment of a web crawler identification method of the present application, where, as shown in fig. 2, the web crawler identification method may include:
step 201, the server receives the picture of the web page and the URL of the web page sent by the client after the web page is rendered.
Step 202, determining whether the connection information of the client and the URL are in a client library to be verified. If not, the report of the client side is not required to be processed, and the process is ended; if the connection information of the client and the URL are in the client library to be verified, step 203 is performed.
The connection information of the client may include an IP address of the client, a connection time of the client, a user agent (user agent), a user identifier of the client, and the like, which may mark the client.
In step 203, the server obtains a sample picture according to the size of the picture of the web page and the URL.
Specifically, the server may obtain the sample picture according to the size of the picture of the web page and the URL as follows: the server searches a picture matching the URL and the size in a sample picture library according to the size of the picture of the webpage and the URL; and outputting a picture matching the URL and the size if the URL is found.
If the picture matching the URL and the size is not found in the sample picture library, the server searches the picture matching the URL in the sample picture library, and if the picture matching the URL is not found yet, the server indicates that the URL does not need sample picture comparison; if the picture matching the URL is found, but the picture matching the URL is not matched with the size, searching the picture closest to the size in the found picture matching the URL, and outputting the found picture closest to the size.
Further, if no picture matching the URL is found in the sample picture library (that is, there is a newly added URL), or if a page of an existing URL in the sample picture library is modified, the server generates a picture of at least one rendering size supported by the URL, stores the generated picture in the sample picture library as a sample picture of the URL, and provides a picture retrieval interface for use in finding a sample picture.
Step 204, determining whether the similarity between the picture of the web page and the sample picture is greater than a preset threshold. If yes, go to step 205; if the similarity between the picture of the webpage and the sample picture is smaller than or equal to a preset threshold value, ending the flow.
The preset threshold may be dynamically set during specific implementation, and the size of the preset threshold is not limited in this embodiment.
In step 205, the server deletes the URL of the web page from the client library to be verified.
Step 206, judging whether other URLs which are required to be verified and correspond to the client are in the client library to be verified; if yes, ending the flow; if there are no other URLs to be verified corresponding to the client in the client library to be verified, step 207 is executed.
In step 207, the server identifies that the client is not a web crawler. The flow ends.
Further, before step 201, the server also receives and processes the normal web page access request of the client, and the flow of processing the normal web page access request of the client by the server may be as shown in fig. 3. FIG. 3 is a flowchart of a further embodiment of a web crawler identification method of the present application, which may include:
in step 301, the server receives the connection information of the client and the URL currently accessed by the client. The connection information of the client may include an IP address of the client, a connection time of the client, a user agent, a user identifier of the client, and the like, which may mark the client.
Step 302, determining whether the connection information of the client is in a web crawler library. If so, then step 303 is performed; if the connection information of the client is not in the web crawler library, step 304 is performed.
In step 303, the server identifies the client as a web crawler, and the current flow ends.
That is, in this embodiment, the client may be quickly identified as a web crawler by the connection information of the client appearing in the web crawler library.
Step 304, judging whether the connection information of the client is in a client library to be verified; if not, then step 305 is performed; if the connection information of the client is in the client library to be verified, step 306 is performed.
And 305, the server stores the connection information of the client and the URL currently accessed by the client into a client library to be verified, and the process is ended.
Step 306, determining whether the time of the connection information of the client in the client library to be verified exceeds a preset duration. If so, then step 307 is performed; if the connection information of the client does not exist in the client library to be verified for more than a preset duration, step 305 is executed.
The preset duration may be dynamically set according to a service form when the preset duration is specifically implemented, and in this embodiment, the length of the preset duration is not limited, for example, the preset duration may be 10 seconds.
In step 307, the server identifies the client as a web crawler, and stores the connection information of the client in a web crawler library. The flow ends.
According to the web crawler identification method, the client is required to report the rendered webpage picture, the server uses the sample picture to verify the picture reported by the client, and if the similarity between the picture reported by the client and the sample picture reaches the preset threshold, the client is considered to be normal webpage access. According to the web crawler identification method, the client is ensured to render the webpage, so that even if the web crawler breaks through the identification method, webpage rendering is needed, and the webpage rendering consumes more time compared with webpage analysis only, so that the web crawler cannot crawl the webpage at high frequency.
In summary, the web crawler identification method provided by the application has the following advantages:
1. the method has the advantages that the problem that the web crawler falsifies the user agent is avoided, and because the method does not depend on user agent information, the reliability of the web crawler identification method for identifying the web crawler is higher;
2. And the browsing experience of a normal user is not affected. Because the client side also needs to perform webpage rendering when the normal user browses the webpage, the web crawler identification method provided by the application intercepts the picture of the webpage and reports the picture to the server after the normal webpage rendering of the client side is completed, so that the smoothness of browsing the webpage by the normal user is not affected.
3. Even if the web crawler breaks the web crawler identification method provided by the application, the web crawler also has to render the web page and report the picture of the rendered web page, so that the web crawler resource can be greatly consumed, and the frequency of accessing the web page by the web crawler can be reduced.
FIG. 4 is a flowchart of a method for identifying a web crawler according to another embodiment of the present application, as shown in FIG. 4, the method may include:
step 401, after the webpage is rendered, the client obtains the picture of the currently rendered webpage and the URL of the webpage.
Step 402, the client sends the picture of the web page and the URL of the web page to the server, so that the server obtains a sample picture according to the URL, and identifies whether the client is a web crawler according to the comparison between the similarity and a preset threshold, where the similarity is the similarity between the picture of the web page and the sample picture.
The preset threshold may be dynamically set during specific implementation, and the size of the preset threshold is not limited in this embodiment.
In the web crawler identification method, after the web page is rendered, the client acquires the picture of the currently rendered web page and the URL of the web page, and sends the picture of the web page and the URL of the web page to the server, so that the server acquires a sample picture according to the URL, and identifies whether the client is a web crawler according to the similarity between the picture of the web page and the sample picture and the comparison between the similarity between the picture of the web page and the sample picture and a preset threshold value. The method has higher reliability on the recognition of the web crawlers, does not influence the fluency of browsing the web pages of normal users, and can greatly consume the resources of the web crawlers even if the web crawlers crack the recognition method, thereby reducing the frequency of the web crawlers to access the web pages.
FIG. 5 is a flowchart of a web crawler identification method according to another embodiment of the present application, as shown in FIG. 5, the web crawler identification method may include:
step 501, receiving connection information of a client.
The connection information of the client comprises an IP address of the client and connection time of the client; further, the connection information of the client may further include information that may mark the client, such as a user agent (user agent) and a user identifier of the client.
Step 502, if the connection information of the client is in the client library to be verified, and the time that the connection information of the client exists in the client library to be verified exceeds a preset duration, identifying the client as a web crawler.
The preset duration may be dynamically set according to a service form when the preset duration is specifically implemented, and in this embodiment, the length of the preset duration is not limited, for example, the preset duration may be 10 seconds.
Further, after identifying the client as a web crawler, the server may store connection information of the client in a web crawler library.
Further, after receiving the connection information of the client, if the connection information of the client is not in the client library to be verified, the server may store the connection information of the client and the URL currently accessed by the client in the client library to be verified.
Further, in this embodiment, after step 501, before step 502, the server may further determine whether the connection information of the client is in the web crawler library; if yes, the server identifies the client as a web crawler; if the connection information of the client is not in the web crawler library, the server performs step 502.
According to the web crawler identification method, after the connection information of the client is received, if the connection information of the client is in the client library to be verified and the connection information of the client exists in the client library to be verified for more than a preset duration, the server identifies the client as a web crawler. The method has high reliability on the identification of the web crawlers, does not influence the fluency of browsing the web pages of normal users, and has good user experience.
Fig. 6 is a schematic structural diagram of one embodiment of a web crawler identification apparatus of the present application, where the web crawler identification apparatus of the present embodiment may be used as a server, or a part of the server may implement a flow of the embodiment shown in fig. 1 of the present application, and as shown in fig. 6, the web crawler identification apparatus may include: a receiving module 61, an acquiring module 62 and an identifying module 63;
the receiving module 61 is configured to receive a picture of the web page and a URL of the web page, which are sent by the client after the web page is rendered;
an obtaining module 62, configured to obtain a sample picture according to the URL;
the identifying module 63 is configured to identify whether the client is a web crawler according to a comparison between the similarity and a preset threshold, where the similarity is a similarity between the picture of the web page received by the receiving module 61 and the sample picture acquired by the acquiring module 62. Specifically, the identifying module 63 is configured to identify that the client is not a web crawler when the similarity between the picture of the web page and the sample picture acquired by the acquiring module 62 is greater than a preset threshold. The preset threshold may be dynamically set during specific implementation, and the size of the preset threshold is not limited in this embodiment.
In the web crawler recognition device, after the receiving module 61 receives the picture of the web page and the URL of the web page sent by the client after the web page is rendered, the obtaining module 62 obtains a sample picture according to the URL, and the recognition module 63 recognizes whether the client is a web crawler according to the comparison between the similarity of the picture of the web page and the sample picture and a preset threshold value.
Fig. 7 is a schematic structural diagram of another embodiment of a web crawler recognition device according to the present application, where the web recognition device in this embodiment may be used as a server, or a part of the server may implement the processes of the embodiments shown in fig. 1 to 3 of the present application, and compared with the web crawler recognition device shown in fig. 6, the difference lies in that the web crawler recognition device shown in fig. 7 may further include: a judgment module 64;
a judging module 64, configured to judge whether the connection information of the client and the URL are in a client library to be verified before the obtaining module 62 obtains the sample picture; the obtaining module 62 is specifically configured to perform the step of obtaining the sample picture according to the URL when the judging module 64 determines that the connection information of the client and the URL are in the client library to be verified. The connection information of the client may include an IP address of the client, a connection time of the client, a user agent (user agent), a user identifier of the client, and the like, which may mark the client.
Further, the web crawler recognition device may further include: a deletion module 65;
a deletion module 65, configured to delete, before the identification module 63 identifies that the client is not a web crawler, a URL of the web page from the client library to be verified when a similarity between a picture of the web page and the sample picture is greater than a preset threshold;
the judging module 64 is further configured to judge whether other URLs to be verified corresponding to the client exist in the client library to be verified;
at this time, the identifying module 63 is specifically configured to perform the step of identifying that the client is not a web crawler when the judging module 64 determines that there are no other URLs to be verified corresponding to the client in the client library to be verified.
Further, the web crawler recognition device may further include: a save module 66;
the receiving module 61 is further configured to receive, before receiving the picture of the web page and the URL of the web page, connection information of the client and the URL currently visited by the client, where the connection information of the client may include an IP address of the client, connection time of the client, a user agent, a user identifier of the client, and other information that may mark the client;
The judging module 64 is further configured to judge whether the connection information of the client is in a client library to be verified;
and a storage module 66, configured to store the connection information of the client and the URL currently accessed by the client into the client library to be verified when the determination module 64 determines that the connection information of the client is not in the client library to be verified.
Further, the judging module 64 is further configured to, after judging whether the connection information of the client is in the client library to be verified, judge whether the time when the connection information of the client exists in the client library to be verified exceeds a preset duration if the connection information of the client is in the client library to be verified;
the identifying module 63 is further configured to identify the client as a web crawler when the judging module 64 determines that the time of the connection information of the client in the client library to be verified exceeds a preset duration;
the storage module 66 is further configured to store connection information of the client into a web crawler library after the identification module 63 identifies that the client is a web crawler; and storing the connection information of the client and the URL currently accessed by the client into the client library to be verified when the time of the connection information of the client in the client library to be verified does not exceed the preset time.
The preset duration may be dynamically set according to a service form when the preset duration is specifically implemented, and in this embodiment, the length of the preset duration is not limited, for example, the preset duration may be 10 seconds.
Further, the judging module 64 is further configured to judge whether the connection information of the client is in the web crawler library before judging whether the connection information of the client is in the client library to be verified;
the identifying module 63 is further configured to identify the client as a web crawler when the judging module 64 determines that the connection information of the client is in the web crawler library;
the determining module 64 is specifically configured to perform the step of determining whether the connection information of the client is in the client library to be verified after determining that the connection information of the client is not in the web crawler library.
In this embodiment, the obtaining module 62 may include: a search sub-module 621 and an output sub-module 622;
wherein, the searching sub-module 621 is configured to search a sample picture library for a picture matching the URL and the size according to the size of the picture of the web page and the URL;
an output sub-module 622 for outputting a picture matching the URL and the size after the search sub-module 621 searches for a picture matching the URL and the size;
The searching sub-module 621 is further configured to, when no picture matching the URL and the size is found in the sample picture library, search a picture matching the URL in the sample picture library, and search a picture closest to the size in the searched picture matching the URL;
the output sub-module 622 is further configured to output the picture closest to the size that is found by the finding sub-module 621.
Further, the web crawler recognition device may further include: a generation module 67;
a generating module 67, configured to generate a picture of at least one rendering size supported by the URL when no picture matching the URL is found in the sample picture library, or when a page of an existing URL in the sample picture library is modified;
the saving module 66 is further configured to store the picture generated by the generating module 67 in the sample picture library as a sample picture of the URL.
The web crawler identification device has higher reliability on web crawler identification, does not influence the fluency of browsing the webpage of a normal user, and can greatly consume the resources of the web crawler even if the web crawler breaks the identification method, so that the frequency of the web crawler to access the webpage is reduced.
Fig. 8 is a schematic structural diagram of still another embodiment of a web crawler recognition device in the present application, where the web crawler recognition device in the present embodiment may be used as a client, or a part of the client may implement a flow of the embodiment shown in fig. 4 in the present application, and as shown in fig. 8, the web crawler recognition device may include: an acquisition module 81 and a transmission module 82;
the obtaining module 81 is configured to obtain, after the webpage is rendered, a picture of the currently rendered webpage and a URL of the webpage;
the sending module 82 is configured to send the picture of the web page and the URL of the web page obtained by the obtaining module 81 to a server, so that the server obtains a sample picture according to the URL, and identifies whether the client is a web crawler according to comparison between a similarity and a preset threshold, where the similarity is a similarity between the picture of the web page and the sample picture.
The preset threshold may be dynamically set during specific implementation, and the size of the preset threshold is not limited in this embodiment.
In the web crawler identification device, after the web page is rendered, the obtaining module 81 obtains the picture of the currently rendered web page and the URL of the web page, and the sending module 82 sends the picture of the web page and the URL of the web page to a server, so that the server obtains a sample picture according to the URL, and identifies that the client is not a web crawler according to the similarity between the picture of the web page and the sample picture and the comparison between the picture and the preset threshold. The device is higher in reliability of web crawler identification, the fluency of browsing the webpage by a normal user is not affected, even if the web crawler breaks through the identification method, the resource of the web crawler can be greatly consumed, and the frequency of the web crawler accessing the webpage is reduced.
Fig. 9 is a schematic structural diagram of still another embodiment of a web crawler recognition device in the present application, where the web crawler recognition device may be used as a server, or a part of the server to implement the flow of the embodiment of fig. 5 of the present invention. As shown in fig. 9, the web crawler recognition device may include: a receiving module 91 and an identifying module 92;
the receiving module 91 is configured to receive connection information of a client. The connection information of the client comprises an IP address of the client and connection time of the client; further, the connection information of the client may further include information that may mark the client, such as a user agent (user agent) and a user identifier of the client.
And the identifying module 92 is configured to identify the client as a web crawler when the connection information of the client received by the receiving module 91 is in a client library to be verified and the connection information of the client exists in the client library to be verified for more than a preset duration.
The preset duration may be dynamically set according to a service form when the preset duration is specifically implemented, and in this embodiment, the length of the preset duration is not limited, for example, the preset duration may be 10 seconds.
In the above web crawler identification apparatus, after the receiving module 91 receives the connection information of the client, if the connection information of the client is in the client library to be verified, and the time that the connection information of the client exists in the client library to be verified exceeds the preset duration, the identifying module 92 identifies that the client is a web crawler. The device is higher to the reliability of web crawler discernment, does not influence the smoothness that normal user browsed the webpage, and user experience is better.
Fig. 10 is a schematic structural diagram of still another embodiment of the web crawler recognition device of the present application, which is different from the web crawler recognition device shown in fig. 9 in that the web crawler recognition device shown in fig. 10 may further include: a saving module 93 and a judging module 94;
a storage module 93, configured to store connection information of the client into a web crawler library after the identification module 92 identifies that the client is a web crawler.
The storage module 93 is further configured to store, when the connection information of the client received by the receiving module 91 is not in the client library to be verified, the connection information of the client and the URL currently accessed by the client into the client library to be verified.
A judging module 94, configured to judge whether the connection information of the client received by the receiving module 91 is in a web crawler library;
the identifying module 92 is further configured to identify the client as a web crawler when the judging module 94 determines that the connection information of the client is in the web crawler library.
The web crawler recognition device has high reliability on web crawler recognition, does not influence the fluency of browsing the web pages of a normal user, and has good user experience.
It should be noted that in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays (Programmable Gate Array; hereinafter PGA), field programmable gate arrays (Field Programmable Gate Array; hereinafter FPGA), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional module in each embodiment of the present application may be integrated in one processing module, or each module may exist alone physically, or two or more modules may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (7)

1. A web crawler identification method, comprising:
receiving connection information of a client, wherein the connection information of the client comprises an IP address of the client and connection time of the client;
If the connection information of the client is in the client library to be verified and the time that the connection information of the client exists in the client library to be verified exceeds a preset duration, identifying the client as a web crawler, otherwise, identifying whether the client is the web crawler or not through the following steps:
receiving a picture of a webpage and a URL of the webpage, which are sent by a client after the webpage is rendered; judging whether the connection information of the client and the URL are in a client library to be verified or not; if the sample picture is in the client library to be verified, acquiring the sample picture according to the size of the picture of the webpage and the URL; judging whether the similarity between the picture of the webpage and the sample picture is larger than a preset threshold value or not; if the URL is larger than the preset threshold, deleting the URL of the webpage from the client library to be verified; judging whether other URLs which correspond to the client and need to be verified exist in the client library to be verified; and if no other URL needing to be verified exists, identifying the client as a web crawler.
2. The method of claim 1, wherein after the identifying the client as a web crawler, further comprising:
And storing the connection information of the client into a web crawler library.
3. The method according to claim 1, further comprising, after the receiving the connection information of the client:
and if the connection information of the client is not in the client library to be verified, storing the connection information of the client and the URL currently accessed by the client into the client library to be verified.
4. The method according to claim 1, further comprising, after the receiving the connection information of the client:
judging whether the connection information of the client is in a web crawler library or not;
if yes, identifying the client as a web crawler;
and if the connection information of the client is not in the web crawler library, executing the step of identifying the client as a web crawler.
5. A web crawler identification apparatus, comprising:
the receiving module is used for receiving the connection information of the client, wherein the connection information of the client comprises the IP address of the client and the connection time of the client;
the identifying module is used for identifying the client as a web crawler when the connection information of the client received by the receiving module is in a client library to be verified and the time of the connection information of the client in the client library to be verified exceeds a preset duration, otherwise, identifying whether the client is the web crawler or not through the following steps:
Receiving a picture of a webpage and a URL of the webpage, which are sent by a client after the webpage is rendered; judging whether the connection information of the client and the URL are in a client library to be verified or not; if the sample picture is in the client library to be verified, acquiring the sample picture according to the size of the picture of the webpage and the URL; judging whether the similarity between the picture of the webpage and the sample picture is larger than a preset threshold value or not; if the URL is larger than the preset threshold, deleting the URL of the webpage from the client library to be verified; judging whether other URLs which correspond to the client and need to be verified exist in the client library to be verified; and if no other URL needing to be verified exists, identifying the client as a web crawler.
6. The apparatus as recited in claim 5, further comprising:
and the storage module is used for storing the connection information of the client into a web crawler library after the identification module identifies the client as the web crawler.
7. The apparatus of claim 6, wherein the device comprises a plurality of sensors,
and the storage module is further used for storing the connection information of the client and the URL currently accessed by the client into the client library to be verified when the connection information of the client received by the receiving module is not in the client library to be verified.
CN201910957170.1A 2015-05-15 2015-05-15 Web crawler identification method and device Active CN110851680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910957170.1A CN110851680B (en) 2015-05-15 2015-05-15 Web crawler identification method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510250481.6A CN106294368B (en) 2015-05-15 2015-05-15 Web spider identification method and device
CN201910957170.1A CN110851680B (en) 2015-05-15 2015-05-15 Web crawler identification method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201510250481.6A Division CN106294368B (en) 2015-05-15 2015-05-15 Web spider identification method and device

Publications (2)

Publication Number Publication Date
CN110851680A CN110851680A (en) 2020-02-28
CN110851680B true CN110851680B (en) 2023-06-30

Family

ID=57632270

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910957170.1A Active CN110851680B (en) 2015-05-15 2015-05-15 Web crawler identification method and device
CN201510250481.6A Active CN106294368B (en) 2015-05-15 2015-05-15 Web spider identification method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201510250481.6A Active CN106294368B (en) 2015-05-15 2015-05-15 Web spider identification method and device

Country Status (1)

Country Link
CN (2) CN110851680B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092660A (en) * 2017-03-28 2017-08-25 成都优易数据有限公司 A kind of Website server reptile recognition methods and device
CN109582844A (en) * 2018-11-07 2019-04-05 北京三快在线科技有限公司 A kind of method, apparatus and system identifying crawler
CN110503504B (en) * 2019-03-14 2022-02-15 杭州海康威视数字技术股份有限公司 Information identification method, device and equipment of network product
CN110647672B (en) * 2019-08-29 2020-12-11 北京三快在线科技有限公司 Abnormal user detection method and device, electronic equipment and readable storage medium
CN110519280B (en) * 2019-08-30 2022-01-04 北京思维造物信息科技股份有限公司 Crawler identification method and device, computer equipment and storage medium
CN111428179B (en) * 2020-03-19 2023-09-19 新方正控股发展有限责任公司 Picture monitoring method and device and electronic equipment
CN111680206B (en) * 2020-08-13 2021-09-10 云盾智慧安全科技有限公司 Identification method and device of web crawler and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902438A (en) * 2009-05-25 2010-12-01 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
CN102054028A (en) * 2010-12-10 2011-05-11 黄斌 Web crawler system with page-rendering function and implementation method thereof
CN102663000A (en) * 2012-03-15 2012-09-12 北京百度网讯科技有限公司 Establishment method for malicious website database, method and device for identifying malicious website
US8463789B1 (en) * 2010-03-23 2013-06-11 Firstrain, Inc. Event detection
CN103544193A (en) * 2012-07-17 2014-01-29 北京千橡网景科技发展有限公司 Method and apparatus for recognizing network robot
CN103810425A (en) * 2012-11-13 2014-05-21 腾讯科技(深圳)有限公司 Method and device for detecting malicious website
CN104601601A (en) * 2015-02-25 2015-05-06 小米科技有限责任公司 Web crawler detecting method and device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737019B (en) * 2011-03-31 2016-08-24 阿里巴巴集团控股有限公司 Machine behavior determines method, web browser and web page server
CN102790700B (en) * 2011-05-19 2015-06-10 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN102833212B (en) * 2011-06-14 2016-01-06 阿里巴巴集团控股有限公司 Webpage visitor identity identification method and system
CN102495861B (en) * 2011-11-24 2013-09-04 中国科学院计算技术研究所 System and method for identifying web crawler
CN103634366A (en) * 2012-08-27 2014-03-12 北京千橡网景科技发展有限公司 Method and device for identifying network robot
CN103631830A (en) * 2012-08-29 2014-03-12 华为技术有限公司 Method and device for detecting web spiders
CN103279516B (en) * 2013-05-27 2016-09-14 百度在线网络技术(北京)有限公司 Web spider identification method
CN103279548A (en) * 2013-06-06 2013-09-04 浙江大学 Method for performing barrier-free detection on websites
CN103365967B (en) * 2013-06-21 2017-02-08 百度在线网络技术(北京)有限公司 Automatic difference detection method and device based on crawler

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902438A (en) * 2009-05-25 2010-12-01 北京启明星辰信息技术股份有限公司 Method and device for automatically identifying web crawlers
US8463789B1 (en) * 2010-03-23 2013-06-11 Firstrain, Inc. Event detection
CN102054028A (en) * 2010-12-10 2011-05-11 黄斌 Web crawler system with page-rendering function and implementation method thereof
CN102663000A (en) * 2012-03-15 2012-09-12 北京百度网讯科技有限公司 Establishment method for malicious website database, method and device for identifying malicious website
CN103544193A (en) * 2012-07-17 2014-01-29 北京千橡网景科技发展有限公司 Method and apparatus for recognizing network robot
CN103810425A (en) * 2012-11-13 2014-05-21 腾讯科技(深圳)有限公司 Method and device for detecting malicious website
CN104601601A (en) * 2015-02-25 2015-05-06 小米科技有限责任公司 Web crawler detecting method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Weigang Guo et al..A Web Crawler Detection Algorithm Based on Web Page Member List.《2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics》.2012,1-9. *
文凯.恶意网页检测系统设计及在云架构中的应用.《中国优秀硕士学位论文全文数据库 信息科技辑》.2013,I139-19. *

Also Published As

Publication number Publication date
CN106294368B (en) 2019-11-05
CN106294368A (en) 2017-01-04
CN110851680A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN110851680B (en) Web crawler identification method and device
US9479519B1 (en) Web content fingerprint analysis to detect web page issues
CN106503134B (en) Browser jumps to the method for data synchronization and device of application program
CN103593466B (en) Web page loading method and client and server
WO2020199751A1 (en) Method and apparatus for loading page picture, and electronic device
WO2018001124A1 (en) Webpage file sending method, webpage rendering method and apparatus, and webpage rendering system
CN103020123B (en) A kind of method searching for bad video website
US9785710B2 (en) Automatic crawling of encoded dynamic URLs
CN105335404A (en) Page information loading method and device
WO2017097039A1 (en) Method and apparatus for detecting whether video can be played
CN104572777A (en) Webpage loading method and device based on UIWebView component
CN104021154B (en) A kind of method and apparatus scanned in a browser
CN106911735B (en) Data acquisition method and device
CN108334516B (en) Information pushing method and device
CN106534268A (en) A data sharing method and device
CN105635064A (en) CSRF attack detection method and device
CN109446445B (en) Resource acquisition method and device
CN109063142B (en) Webpage resource pushing method, server and storage medium
CN104023046A (en) Mobile terminal recognition method and device
WO2020238567A1 (en) Method and apparatus for resource detection
CN104980464B (en) A kind of network request processing method, network server and network system
CN103761257A (en) Webpage handling method and system based on mobile browser
CN110929129B (en) Information detection method, equipment and machine-readable storage medium
CN111143722A (en) Method, device, equipment and medium for detecting webpage hidden link
WO2020073493A1 (en) Sql injection vulnerability detection method, apparatus and device, and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant