CN110399546B

CN110399546B - Link duplicate removal method, device, equipment and storage medium based on web crawler

Info

Publication number: CN110399546B
Application number: CN201910670803.0A
Authority: CN
Inventors: 雷建云; 王锦群; 郑禄; 毛腾跃; 孙翀; 马尧; 张蕾
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2022-02-08
Anticipated expiration: 2039-07-23
Also published as: CN110399546A

Abstract

The invention relates to the technical field of internet, and discloses a link duplicate removal method, a link duplicate removal device, link duplicate removal equipment and a storage medium based on a web crawler. The method comprises the following steps: when a data capture request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be accessed from the data capture request; according to the first URL link, sending an access request to a platform to be accessed; after receiving a response made by the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link; analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled; and performing joint duplicate removal on a second URL link in the URL queue to be crawled by adopting a counting bloom filter of the link characteristic and combining multiple Hash. According to the invention, the performance of the web crawler is improved by optimizing the link deduplication mode, so that the web crawler can be ensured to rapidly acquire information required by people, and the user experience is improved.

Description

Link duplicate removal method, device, equipment and storage medium based on web crawler

Technical Field

The invention relates to the technical field of internet, in particular to a link duplicate removal method, a link duplicate removal device, link duplicate removal equipment and a link duplicate removal storage medium based on a web crawler.

Background

In order to prevent the efficiency from being reduced and server resources from being wasted due to repeated crawling of the web crawler, the web crawler inevitably encounters repeated downloading of the web page, and therefore, filtering of a Uniform Resource Locator (URL) is required to be performed heavily. The common link deduplication methods at present are: the link is deduplicated by link compression deduplication based on a fifth generation message digest algorithm (MD 5), storage deduplication based on a hash algorithm, link deduplication based on a bloom filter, and the like.

Although, the link compression deduplication scheme based on MD5 solves the problem that Uniform Resource Locators (URLs) occupy a large storage space. However, as the number of URLs increases, the memory space occupancy rate also increases, and the characteristic of low collision probability reduces the accuracy of duplicate checking, thereby seriously affecting the performance of the web crawler.

Although the storage deduplication method based on the hash algorithm is high in deduplication speed and accuracy, a good hash function needs to be designed, and a hash table needs to be maintained. In addition, as the scale of crawling web pages increases, the memory consumption is too high, and therefore, the performance of the web crawler is also seriously affected.

The link deduplication method based on the bloom filter can solve the problem of space complexity, but has certain misjudgment and cannot delete the existing elements. That is, the more elements, the greater the false alarm rate, and thus the performance of the web crawler is also severely affected.

Therefore, it is urgently needed to provide a link deduplication method based on a web crawler to improve the performance of the web crawler, so that the web crawler can quickly acquire information required by people, and further improve user experience.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a link deduplication method, a link deduplication device, equipment and a storage medium based on a web crawler, aiming at improving the performance of the web crawler by optimizing a link deduplication mode, so that the web crawler can be ensured to acquire information needed by people quickly, and user experience is improved.

In order to achieve the above object, the present invention provides a link deduplication method based on web crawlers, the method comprising the following steps:

when a data capture request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be accessed from the data capture request;

according to the first URL link, sending an access request to the platform to be accessed;

after receiving a response made by the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link;

analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled;

and adopting a counting bloom filter of link characteristics and combining multiple Hash to perform joint duplicate removal on the second URL link in the URL queue to be crawled.

Preferably, before the step of jointly deduplicating the second URL link in the URL queue to be crawled by using a counting bloom filter of a link feature and combining multiple hashes, the method further includes:

traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link;

obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part;

and establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.

Preferably, the step of jointly removing the duplicate of the second URL link in the URL queue to be crawled by using a counting bloom filter of a link feature and combining multiple hashes includes:

traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link;

carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link;

according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments;

recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments, wherein N is an integer greater than or equal to 1;

performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;

and according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.

Preferably, after the step of recombining the plurality of feature segments according to a preset URL link recombination rule to obtain N recombined URL link segments, the method further includes:

on the basis of an MD5 algorithm, respectively compressing the obtained N recombined URL link segments to obtain character string ciphertexts corresponding to the N recombined URL link segments;

and replacing the content in the corresponding recombined URL link segment with the character string ciphertext.

Preferably, the step of performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link includes:

extracting character string ciphertexts corresponding to the N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values, wherein K is an integer greater than or equal to 2;

hashing the K hash values to a bit vector space which is constructed in advance to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value;

respectively carrying out Hash processing on the remaining N-1 character string ciphertext for K times to obtain K Hash values corresponding to each remaining character string ciphertext;

randomly hashing K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value;

inserting a preset character for each hash value newly hashed to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method;

and counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.

Preferably, after the step of jointly de-duplicating the second URL link in the URL queue to be crawled by using a counting bloom filter of a link feature and combining multiple hashes, the method further includes:

based on an MD5 algorithm, compressing each second URL link in the de-duplicated URL queue to be crawled to obtain a character string ciphertext corresponding to each second URL link;

and replacing the content in the corresponding second URL link with the character string ciphertext.

judging whether an accessed second URL link exists in the URL queue to be crawled after duplication removal;

and if the accessed second URL link exists in the URL queue to be crawled, deleting the accessed second URL link from the URL queue to be crawled.

In addition, in order to achieve the above object, the present invention further provides a web crawler-based link deduplication device, including:

the system comprises an extraction module, a data acquisition module and a data acquisition module, wherein the extraction module is used for extracting a first Uniform Resource Locator (URL) link of a platform to be accessed from a data acquisition request when the data acquisition request of an agricultural product to be analyzed is received;

the sending module is used for sending an access request to the platform to be accessed according to the first URL link;

the grabbing module is used for grabbing data information in a page corresponding to the first URL link after receiving a response made by the platform to be visited according to the visit request;

the analysis module is used for analyzing the data information to obtain a second URL link embedded in the page and adding the second URL link to a URL queue to be crawled;

and the duplication removing module is used for adopting a counting bloom filter of the link characteristics and combining multiple Hash to carry out combined duplication removal on the second URL link in the URL queue to be crawled.

In addition, in order to achieve the above object, the present invention further provides a link deduplication device based on a web crawler, including: the computer-readable medium includes a memory, a processor, and a web crawler-based link deduplication program stored on the memory and executable on the processor, the web crawler-based link deduplication program being configured to implement the steps of the web crawler-based link deduplication method as described above.

In addition, to achieve the above object, the present invention further provides a computer-readable storage medium, on which a web crawler-based link deduplication program is stored, which, when executed by a processor, implements the steps of the web crawler-based link deduplication method as described above.

According to the link duplication removing scheme based on the web crawler, the counting bloom filter with the link characteristics is adopted, and the multiple Hash is combined to carry out combined duplication removing on the second URL link cached in the URL queue to be crawled, so that the misjudgment rate of the counting bloom filter is reduced as much as possible, the performance of the web crawler is obviously improved, the web crawler can be ensured to rapidly obtain information required by people, and the user experience is improved.

Drawings

FIG. 1 is a schematic diagram of a web crawler-based linked deduplication equipment of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a web crawler-based link deduplication method according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of a web crawler-based link deduplication method according to the present invention;

FIG. 4 is a block diagram of a first embodiment of a web crawler-based link deduplication device of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a web crawler-based link resetting apparatus in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the web crawler-based link de-installation may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the architecture shown in FIG. 1 does not constitute a limitation of web crawler-based link deduplication equipment, and may include more or fewer components than shown, or combine certain components, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a web crawler-based link deduplication program.

In the web crawler-based link deduplication apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a web server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the web crawler-based link deduplication device of the present invention may be disposed in the web crawler-based link deduplication device, and the web crawler-based link deduplication device calls the web crawler-based link deduplication program stored in the memory 1005 through the processor 1001, and executes the web crawler-based link deduplication method provided in the embodiment of the present invention.

An embodiment of the present invention provides a link deduplication method based on a web crawler, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of a link deduplication method based on a web crawler according to the present invention.

In this embodiment, the link deduplication method based on the web crawler includes the following steps:

step S10, when a data grabbing request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be visited from the data grabbing request.

Specifically, the execution main body of the embodiment is a terminal device arbitrarily deployed or installed with a web crawler system.

It should be noted that, in this embodiment, in order to improve operations such as a capturing speed and an analyzing speed of data corresponding to an agricultural product to be analyzed as much as possible, the web crawler system described in this embodiment is preferably a distributed web crawler system.

In addition, it should be understood that, in practical applications, the terminal device may be a client device or a server device, and is not limited herein.

In addition, the platform to be accessed can be a network mall displaying agricultural products to be analyzed in practical application.

Accordingly, the Uniform Resource Locator (URL) is a network address required for accessing the network mall.

In addition, it should be understood that the agricultural products to be analyzed are only a general term for various common agricultural products at present, and in practical applications, the agricultural products to be analyzed may be tea products, fruit and vegetable products, food products, and the like, which are not listed here, and no limitation is made thereto.

And step S20, sending an access request to the platform to be accessed according to the first URL link.

Specifically, in practical applications, the web crawler may send an access request to the platform to be accessed (substantially, a server of the platform) by using a HyperText Transfer Protocol (HTTP) that transmits data based on a Transmission Control Protocol/Internet Protocol (TCP/IP).

It should be understood that the above is only a specific implementation manner of sending the access request to the platform to be accessed, and the technical solution of the present invention is not limited at all, and in practical applications, those skilled in the art may set the implementation manner as needed, and the implementation manner is not limited herein.

Step S30, after receiving a response from the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link.

It should be understood that, in practical applications, if the access request sent to the platform to be accessed is successful, and the platform to be accessed successfully verifies the first URL link carried in the access request, a successful response is made, and the data information in the page corresponding to the first URL link is fed back. At this time, the web crawler may capture the data information in the page corresponding to the first URL link, which is fed back by the platform to be accessed.

And step S40, analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled.

It should be understood that, in practical applications, besides displaying the same data information as the agricultural product to be analyzed, a plurality of URL links related to the data information may be displayed in the page corresponding to the first URL link, which is referred to as a second URL link herein for convenience of distinction.

For example, a web mall homepage including the agricultural product to be analyzed is displayed in a page corresponding to the first URL link, four types of agricultural product information including an agricultural product a, an agricultural product B, an agricultural product C, an agricultural product D and the like are mainly displayed in the homepage, meanwhile, each type of agricultural product corresponds to a second URL link, and a small type of agricultural product included in the corresponding agricultural product is mainly displayed in a page corresponding to the second URL link.

For example, agricultural products A-1, agricultural products A-2 and agricultural products A-3 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural products A; agricultural products B-1 and B-2 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural product B; agricultural products C-1, C-2, C-3 and C4 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural product C; and the agricultural product D-1 and the agricultural product D-2 are mainly displayed in the page corresponding to the second URL link corresponding to the agricultural product D.

It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.

In addition, in this embodiment, the reason that the second URL link embedded in the page is to be added to the URL queue to be crawled is that in practical application, the number of the second URL links obtained through analysis is relatively large because the data crawled by the web crawler is large. And each crawling and analyzing of a second URL link consumes much time, so that a large number of second URL links cannot be visited in a short time, and the second URL links acquired each time need to be added into a URL queue to be crawled.

In addition, the "first" of the "first URL link" and the "second" of the "second URL link" are only used for distinguishing the URL link corresponding to the platform to be visited from the URL link embedded in the page corresponding to the URL link, and do not limit the URL link itself. In practical applications, any "second URL link" may be regarded as a "first URL link" with respect to the URL link embedded in the corresponding page.

And step S50, performing joint duplicate removal on the second URL link in the URL queue to be crawled by adopting a counting bloom filter of link characteristics and combining multiple hashes.

Specifically, the joint deduplication of the second URL link in the URL queue to be crawled by using the counting bloom filter with link characteristics and combining multiple hashes is mainly divided into deduplication of the URL link with overall characteristics corresponding to the URL link and deduplication of a URL link fragment.

Since the URL link segment is obtained according to the global feature URL link, in order to ensure that the joint deduplication operation can be performed smoothly, the corresponding relationship between the second URL link and the global feature URL link needs to be determined first.

For ease of understanding, the present embodiment provides a specific implementation manner for determining the correspondence between the second URL link and the whole feature URL link, which is roughly as follows:

(1) and traversing the URL queue to be crawled, performing characteristic analysis on the traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link.

Specifically, since in practical applications the URL links are used to uniquely identify resources on the network. Also, in general, a URL link will typically contain the following five components: a Protocol type part (usually denoted by Protocol), a server address part (usually denoted by user Host), a Port number part (usually denoted by Port), a Path part (usually denoted by Path), and a query part (usually denoted by Fragment).

Wherein, the three parts of the protocol type part, the path part and the inquiry part can usually embody the characteristics of a URL link.

Therefore, in this embodiment, the URL queue to be crawled is traversed, and the traversed current second URL link is subjected to feature analysis, so as to extract the protocol type part of the current second URL link (for convenience of the following description, the following user p is referred to as a "user" p)₁Presentation), path section (for convenience of the following description user p)₂Presentation) and an inquiry section (for convenience of the following description user p₃Representation).

(2) And obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part.

In particular, since p is₁、p₂And p₃These three parts can embody the full characteristics of the current second URL link, thus by p₁、p₂And p₃The combination is performed to obtain the global characteristic URL link corresponding to the current second URL link, which is hereinafter referred to as p₁p₂p₃Representing the global characteristic URL link to which each second URL link corresponds.

(3) And establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.

Specifically, in this embodiment, the correspondence between the current second URL link and the overall characteristic URL link is to be established, and the correspondence is updated to the to-be-crawled URL queue, so that in a subsequent process of deduplication of the second URL link, the correspondence can be used to quickly find the overall characteristic URL link corresponding to the current second URL link, and further, the URL link segment corresponding to the current second URL link is obtained according to the overall URL link.

In addition, in practical application, the corresponding relation may not be updated to the URL queue to be crawled, but may be stored separately. And when the second URL link in the URL queue to be crawled is subjected to joint duplicate removal, searching the integral characteristic URL link corresponding to the current second URL link from the separately stored corresponding relation table according to the traversed current second URL link.

Further, after obtaining the correspondence and the overall characteristic URL link corresponding to each second URL link, the above-mentioned counting bloom filter using the link characteristic and performing a joint deduplication operation on the second URL links in the to-be-crawled URL queue by combining multiple hashes may specifically be as follows:

(1) and traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to the traversed current second URL link.

Specifically, the whole characteristic URL link corresponding to the traversed current second URL link is obtained according to the above correspondence.

(2) And performing integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link.

Specifically, the counting bloom filter used in the present embodiment is not a counting bloom filter used in the existing link deduplication, but a counting bloom filter based on the link characteristics of URL links.

That is to say, when the computing bloom filter of this embodiment deduplicates a link, specifically, feature recognition is performed on an overall feature URL link corresponding to each second URL link in a URL queue to be crawled, and then, overall duplicate checking is performed according to the recognized feature, that is, feature comparison is performed on each second entered link during deduplication, so that overall duplicate checking is achieved.

In addition, in order to conveniently identify the URL link segment which is subsequently recombined according to the feature segment, a corresponding duplication checking mark is distributed to the whole feature URL link.

(3) And according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments.

Specifically, the global feature URL link is still taken as p₁p₂p₃For example, after performing the feature recognition on the whole feature URL link, the obtained plurality of feature segments may specifically be segments respectively including a protocol type portion, a path portion and a query portion, that is, a feature segment p₁Characteristic fragment p₂And a characteristic fragment p₃。

(4) And recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments.

It should be understood that since an entire characteristic URL link is composed of three parts, a protocol type part, a path part and a query part, at least 1 recombined URL link segment is obtained, and N is an integer greater than or equal to 1 in this embodiment.

In addition, in practical applications, the URL link restructuring rule may be set by those skilled in the art as required, for example, the URL link segment after restructuring must include the feature segment p₁Or the recomposed URL link segment cannot include the feature segment p₃And so on, which are not listed here, and do not limit in any way.

Accordingly, if the URL link restructuring rule is that the restructured URL link segment must include the characteristic segment p₁The resulting recombined URL link fragment substantially includes p only₁URL link segment of feature segment, including only p₁Characteristic fragment and p₂URL link segments of feature segments, and including only p₁Characteristic fragment and p₃The URL of the feature fragment links the fragment.

If the URL link recombination rule is that the recombined URL link segment cannot include the characteristic segment p₃Then the obtained recombined URL link segmentRoughly comprising only p₁URL link segment of feature segment and including only p₁Characteristic fragment and p₂The URL of the feature fragment links the fragment.

It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the technical solution according to actual needs, and the technical solution is not limited herein.

(5) And performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link.

It should be noted that, in practical applications, since there may be a large number of second URL links cached in the URL queue to be crawled, the number of URL link segments obtained after the reorganization is more. Therefore, in this embodiment, in order to reduce the occupation of the second URL link cached in the URL queue to be crawled to the storage space as much as possible, after the plurality of feature segments are recombined according to the preset URL link recombination rule to obtain N recombined URL link segments, the obtained N recombined URL link segments may be respectively compressed based on the MD5 algorithm to further obtain the character string ciphertexts corresponding to the N recombined URL link segments, and finally the character string ciphertexts replace the content in the corresponding recombined URL link segments.

It should be understood that the above is only a specific compression method, and the technical solution of the present invention is not limited in any way, and in practical applications, a person skilled in the art can select a suitable compression method according to actual needs, and is not limited herein.

Correspondingly, the operation of performing multiple hash duplicate checking on the N recombined URL link segments to obtain the duplicate checking result corresponding to the current second URL link specifically includes:

(5-1) extracting the character string ciphertexts corresponding to the N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values.

It should be understood that, in the link deduplication scheme based on the web crawler provided in this embodiment, multiple hashes are specifically combined when performing joint deduplication on links, that is, at least 2 hash processes need to be performed on a string ciphertext, so that K is an integer greater than or equal to 2.

And (5-2) hashing the K hash values to a pre-constructed bit vector space to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value.

Specifically, in the present embodiment, the initial count value displayed on the spatially variable counter corresponding to each reference hash value is represented by "0".

And (5-3) carrying out Hash processing on the remaining N-1 character string ciphertext for K times respectively to obtain K Hash values corresponding to each remaining character string ciphertext.

And (5-4) randomly hashing the K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value.

Specifically, in order to determine whether the hash value newly hashed into the bit vector space is adjacent to the reference hash value, a determination criterion may be preset, for example, when a new hash value is inserted between two adjacent reference hash values, the reference hash value closest to the newly inserted hash value may be selected as the adjacent reference hash value.

And (5-5) inserting a preset character for each newly hashed hash value to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method.

Specifically, in this embodiment, the preset character is represented by "1".

For example, for a reference hash value, the initial count value displayed on the corresponding spatially variable counter is "0". When a new hash value is hashed to a position adjacent to the new hash value, a preset character "1" needs to be inserted in front of "0" by using a header insertion method, and the count value displayed on the space variable counter becomes "10".

Accordingly, if there are two new hash values hashed to the desired position of the reference hash value, a two-bit preset character "1" needs to be inserted in front of "0" by using a header insertion method, and the count value displayed on the space-variable counter becomes "110".

And (5-6) counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.

Specifically, the determined duplicate checking result may be:

if the number of the preset characters '1' in front of the initial count value '0' is more than 1, determining that the recombined URL segment is repeated and needs to be discarded;

otherwise, determining that the recombined URL segment is not repeated and can be reserved.

(6) And according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.

It should be understood that the above is only a specific implementation manner of joint deduplication, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art may reasonably adjust the implementation manner according to needs, and the implementation manner is not limited herein.

In addition, in practical application, in order to further reduce the occupation of a storage space, after a counting bloom filter of link characteristics is adopted and multiple hashes are combined to perform joint deduplication on the second URL links in the URL queue to be crawled, each second URL link in the URL queue to be crawled after deduplication is performed can be compressed based on an MD5 algorithm, and then a character string ciphertext corresponding to each second URL link is obtained; and finally, replacing the content in the corresponding second URL link with the character string ciphertext, so that the second URL link in the URL queue to be crawled is compressed as much as possible, and the occupation of a storage space is reduced.

As can be seen from the above description, in the link deduplication method based on the web crawler provided by this embodiment, the counting bloom filter with the link characteristic is adopted, and multiple hashes are combined to perform integral and partial joint deduplication on the second URL link cached in the URL queue to be crawled, so that the misjudgment rate of the counting bloom filter is reduced as much as possible, the performance of the web crawler is effectively improved, the web crawler can quickly acquire information required by people, and the user experience is improved as much as possible.

In addition, in the deduplication process, the URL link is compressed based on a compression algorithm, such as the MD5 algorithm, so that the occupation of the storage space is reduced as much as possible.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a web crawler-based link deduplication method according to the present invention.

Based on the first embodiment, after the step S50, the web crawler-based link deduplication method in this embodiment further includes:

step S60, determining whether there is an accessed second URL link in the URL queue to be crawled after deduplication is performed.

Specifically, if it is determined through the judgment that the second accessed URL link exists in the to-be-crawled URL queue after the duplication removal, that is, the web crawler accesses the page corresponding to the second URL link according to the second URL link, and captures data information in the page, so as to avoid repeated capture of the same data and waste of web crawler resources due to the fact that the web crawler accesses the second URL link again, the operation of step S70 needs to be executed; otherwise, execution continues with step S60.

And step S70, deleting the accessed second URL link from the URL queue to be crawled.

Specifically, in practical applications, a deletion operation may be performed when it is detected that one second URL link is accessed, or all the currently marked second URL links may be deleted together when the accessed second URL links are marked first and then the marked accessed second URL links reach a predetermined number or a predetermined deletion time.

As can be seen from the above description, according to the link deduplication method based on the web crawler provided by this embodiment, the access condition of the second URL link in the URL queue to be crawled is detected in a timed or real-time manner, and when it is detected that the second URL link that has been accessed exists in the URL queue to be crawled, the second URL link that has been accessed is deleted from the URL queue to be crawled, so that it can be ensured that the second URL links cached in the URL queue to be crawled are all the second URL links that have not been accessed, and therefore, the web crawler is prevented from repeatedly crawling the same data according to the same second URL link, and the performance of the web crawler is further improved.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a web crawler-based link deduplication program is stored on the computer-readable storage medium, and when executed by a processor, the web crawler-based link deduplication program implements the steps of the web crawler-based link deduplication method described above.

Referring to fig. 4, fig. 4 is a block diagram illustrating a first embodiment of a web crawler-based link deduplication device of the present invention.

As shown in fig. 4, a web crawler-based link deduplication apparatus according to an embodiment of the present invention includes: an extraction module 4001, a sending module 4002, a grabbing module 4003, an analysis module 4004 and a de-duplication module 4005.

The extraction module 4001 is configured to, when a data capture request of an agricultural product to be analyzed is received, extract a first Uniform Resource Locator (URL) link of a platform to be visited from the data capture request; a sending module 4002, configured to send an access request to the platform to be accessed according to the first URL link; the grabbing module 4003 is configured to grab data information in a page corresponding to the first URL link after receiving a response made by the platform to be visited according to the access request; the analyzing module 4004 is configured to analyze the data information to obtain a second URL link embedded in the page, and add the second URL link to a URL queue to be crawled; the duplication removing module 4005 is configured to perform joint duplication removal on the second URL link in the URL queue to be crawled by using a counting bloom filter of link characteristics in combination with multiple hashes.

It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but it does not indicate that there is no other unit in the present embodiment.

In addition, it is worth mentioning that, in this embodiment, when the duplication removing module 4005 adopts a counting bloom filter of link characteristics and performs joint duplication removal on the second URL link in the URL queue to be crawled by combining multiple hashes, the joint duplication removal is specifically divided into duplication removal on the whole characteristic URL link corresponding to the URL link and duplication removal on a URL link fragment.

Since the URL link segment is obtained from the global feature URL link, the correspondence between the second URL link and the global feature URL link needs to be determined in order to ensure that the deduplication module 4005 can perform the above operations smoothly.

Regarding the manner of determining the correspondence between the second URL link and the global characteristic URL link, the following may be roughly described:

firstly, traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link;

then, obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part;

and finally, establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.

Accordingly, after obtaining the above correspondence, the operation performed by the deduplication module 4005 is specifically:

firstly, traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link;

then, carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link;

then, according to the duplication checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments;

secondly, recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments;

then, performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;

and finally, according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.

In this embodiment, N is an integer of 1 or more.

In addition, it should be understood that what is given above is only a specific implementation manner of determining a corresponding relationship between a second URL link and an overall characteristic URL link, and using a counting bloom filter of a link characteristic, and combining multiple hashes to perform joint deduplication on the second URL link in the URL queue to be crawled, and the technical scheme of the present invention is not limited at all.

Further, in practical application, in order to reduce the occupation of the second URL link cached in the URL queue to be crawled to the storage space as much as possible, after the plurality of feature segments are recombined according to a preset URL link recombination rule to obtain N recombined URL link segments, the obtained N recombined URL link segments may be respectively compressed based on an MD5 algorithm to obtain character string ciphertexts corresponding to the N recombined URL link segments, and finally the character string replacing ciphertexts are omitted from the content in the corresponding recombined URL link segments.

Correspondingly, the operation of performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link specifically includes:

firstly, extracting character string ciphertexts corresponding to N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values;

then, hashing the K hash values to a pre-constructed bit vector space to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value;

then, carrying out Hash processing on the remaining N-1 character string ciphertext for K times respectively to obtain K Hash values corresponding to each remaining character string ciphertext;

then, randomly hashing K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value;

then, inserting a preset character for each hash value newly hashed to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method;

and finally, counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.

In this embodiment, K is an integer of 2 or more.

In addition, it should be understood that the above is only a specific implementation manner for obtaining the duplicate checking result corresponding to the current second URL link, and the technical solution of the present invention is not limited at all, and in a specific application, a person skilled in the art may set the duplicate checking result as needed, and the present invention is not limited to this.

In addition, in practical application, in order to further reduce the occupation of a storage space, after the second URL links in the URL queue to be crawled are subjected to joint duplicate removal, each second URL link in the URL queue to be crawled after the duplicate removal can be compressed based on an MD5 algorithm, so as to obtain a character string ciphertext corresponding to each second URL link; and finally, replacing the content in the corresponding second URL link with the character string ciphertext, so that the second URL link in the URL queue to be crawled is compressed as much as possible, and the occupation of a storage space is reduced.

It can be easily seen from the above description that the link duplication removing device based on web crawlers provided by this embodiment is through adopting the counting bloom filter of the link characteristics, and combines multiple hash pairs to perform whole and partial joint duplication removing on the second URL link cached in the URL queue to be crawled, thereby reducing the misjudgment rate of the counting bloom filter as much as possible, effectively improving the performance of the web crawlers, enabling the web crawlers to rapidly acquire the information required by people, and improving the user experience as much as possible.

It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may be referred to a web crawler-based link deduplication method provided in any embodiment of the present invention, and are not described herein again.

Based on the first embodiment of the web crawler-based link deduplication device, a second embodiment of the web crawler-based link deduplication device of the present invention is provided.

In this embodiment, the web crawler-based link deduplication device further includes a deletion module.

Specifically, the deleting module is configured to determine whether an accessed second URL link exists in the URL queue to be crawled after the duplication is removed.

Correspondingly, if the accessed second URL link exists in the URL queue to be crawled, deleting the accessed second URL link from the URL queue to be crawled; and if not, continuously monitoring a second URL link in the URL queue to be crawled, and judging whether the accessed second URL link exists.

In addition, it should be understood that the above is only an example, and the technical solution of the present invention is not limited at all, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited to this.

It can be easily seen from the above description that, in the link deduplication device based on the web crawler provided in this embodiment, the access condition of the second URL link in the URL queue to be crawled is detected in a timed or real-time manner, and when it is detected that the second URL link which has been accessed exists in the URL queue to be crawled, the accessed second URL link is deleted from the URL queue to be crawled, so that it can be ensured that the second URL links cached in the URL queue to be crawled are all the second URL links which have not been accessed, and therefore, the web crawler is prevented from repeatedly crawling the same data according to the same second URL link, and the performance of the web crawler is further improved.

Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A link deduplication method based on web crawlers is characterized by comprising the following steps:

performing joint duplicate removal on the second URL link in the URL queue to be crawled by adopting a counting bloom filter of link characteristics and combining multiple Hash;

the step of adopting a counting bloom filter of link characteristics and combining multiple hashes to perform joint deduplication on the second URL link in the URL queue to be crawled comprises the following steps of:

2. The method of claim 1, wherein prior to the step of employing a counting bloom filter of link characteristics in combination with multiple hashing to jointly deduplicate the second URL link in the URL queue to be crawled, the method further comprises:

3. The method according to claim 2, wherein after the step of recombining the plurality of feature segments according to a preset URL link recombination rule to obtain N recombined URL link segments, the method further comprises:

4. The method of claim 3, wherein the step of performing multiple hash duplicate checking on the N recombined URL link segments to obtain the duplicate checking result corresponding to the current second URL link comprises:

5. The method of any of claims 1 to 4, wherein after the step of employing a counting bloom filter of link characteristics in combination with multiple hashing to jointly deduplicate the second URL link in the URL queue to be crawled, the method further comprises:

6. The method of any of claims 1 to 4, wherein after the step of employing a counting bloom filter of link characteristics in combination with multiple hashing to jointly deduplicate the second URL link in the URL queue to be crawled, the method further comprises:

7. A web crawler-based link deduplication apparatus, the apparatus comprising:

the duplication removing module is used for performing combined duplication removal on the second URL link in the URL queue to be crawled by adopting a counting bloom filter of link characteristics and combining multiple Hash;

the duplication eliminating module is also used for traversing the URL queue to be crawled and acquiring an integral characteristic URL link corresponding to the traversed current second URL link;

the duplication removing module is further used for carrying out integral duplication checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplication checking mark corresponding to the integral characteristic URL link;

the duplication eliminating module is further used for carrying out feature identification on the integral feature URL link according to the duplication checking mark to obtain a plurality of feature segments;

the duplication removing module is further used for recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments, wherein N is an integer greater than or equal to 1;

the duplicate removal module is further configured to perform multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;

and the duplication removing module is also used for reserving or discarding the second URL link in the URL queue to be crawled according to the duplication checking result.

8. A web crawler-based link deduplication apparatus, the apparatus comprising: a memory, a processor, and a web crawler-based link deduplication program stored on the memory and executable on the processor, the web crawler-based link deduplication program configured to implement the steps of the web crawler-based link deduplication method as recited in any one of claims 1 through 6.

9. A computer-readable storage medium, wherein a web crawler-based link deduplication program is stored on the computer-readable storage medium, and when executed by a processor, implements the steps of the web crawler-based link deduplication method according to any one of claims 1 through 6.