CN113821705A

CN113821705A - Webpage content acquisition method, terminal equipment and readable storage medium

Info

Publication number: CN113821705A
Application number: CN202111007979.1A
Authority: CN
Inventors: 蒋林钰
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-21
Anticipated expiration: 2041-08-30
Also published as: CN113821705B

Abstract

The invention discloses a method for obtaining webpage content, a terminal device and a readable storage medium. The method for obtaining webpage content includes: binding at least two IP addresses; obtaining a target webpage address from a webpage address queue, wherein the The number of webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses; obtain the target IP address corresponding to the target webpage address in the bound IP address; The target IP address crawls the webpage content corresponding to the target webpage address. The invention can improve the acquisition efficiency of webpage content.

Description

Webpage content acquisition method, terminal equipment and readable storage medium

Technical Field

The invention relates to the technical field of web crawlers, in particular to a method for acquiring webpage content, a terminal device and a readable storage medium.

Background

At present, web crawlers are usually used to acquire web page content, but the most fatal problem in web crawlers is the IP (Internet Protocol) restriction problem. The number of times a single IP address is allowed to request a given website within a certain period of time is limited, and if the number of requests exceeds an upper limit, the request will be identified as a crawler by the website and rejected. Of course, the IP limitation is not specific to web crawlers, but is also a measure for preventing DoS (Denial of Service) attacks. The number of IP addresses used by the common asynchronous web crawler in crawling is limited, and the web crawler is easy to reach the request upper limit to cause the request to be rejected, so that the acquisition efficiency of the web page content is low.

Disclosure of Invention

The invention mainly aims to provide a method for acquiring webpage content, a terminal device and a readable storage medium, and aims to improve the acquisition efficiency of the webpage content.

In order to achieve the above object, the present invention provides a method for acquiring web page content, where the method for acquiring web page content includes:

binding at least two IP addresses;

acquiring a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;

acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;

and crawling the webpage content corresponding to the target webpage address through the target IP address.

Optionally, the bound IP address is an IPv6 address under the same subnet.

Optionally, before the step of crawling the web content corresponding to the target web address by using the target IP address, the method further includes:

obtaining a browser test frame (Selenium);

the step of crawling the web page content corresponding to the target web page address through the target IP address comprises the following steps:

and crawling the webpage content corresponding to the target webpage address through the Selenium and the target IP address.

Optionally, after the step of crawling the web content corresponding to the target web address by using the target IP address, the method further includes:

acquiring the context characteristics and the space characteristics of the webpage content;

generating a feature vector corresponding to the context feature;

generating an adjacency matrix corresponding to the spatial features;

inputting the feature vectors and the adjacency matrix into a target graph neural network model to determine output values of the target graph neural network model;

and determining the target dormancy duration of the current crawling action according to the output value.

Optionally, the step of determining a target sleep duration of the current crawling action according to the output value includes:

determining a modification value of the dormancy duration of the current crawling action according to the output value;

acquiring preset dormancy duration;

and correcting the preset dormancy duration by adopting the correction value to obtain the target dormancy duration.

Optionally, before the step of binding at least two IP addresses, the method further includes:

acquiring a sample data set, wherein the sample data set comprises a training set and a test set;

training a preset graph neural network model by adopting the training set;

testing the trained preset map neural network model by using the test set to determine an output value of the trained preset map neural network model;

and when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.

Optionally, the step of acquiring the sample data set includes:

obtaining context characteristics and space characteristics of the marked webpage content;

and determining the sample data set according to the marked contextual characteristics and spatial characteristics of the webpage content.

Optionally, before the step of obtaining the target webpage address from the webpage address queue, the method further includes:

initializing a webpage address queue;

and adding the webpage address to be crawled to the webpage address queue.

In addition, in order to achieve the above object, the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a program for acquiring web content stored in the memory and executable on the processor, and the program for acquiring web content is executed by the processor to implement any one of the steps of the method for acquiring web content.

In addition, to achieve the above object, the present invention further provides a readable storage medium, in which a program for acquiring web content is stored, and the program for acquiring web content, when executed by a processor, implements the steps of the method for acquiring web content according to any one of the above items.

The invention provides a method for acquiring webpage content, a terminal device and a readable storage medium, wherein the terminal device acquires a target webpage address from a webpage address queue by binding IP addresses, the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses, the target IP address corresponding to the target webpage address is acquired from the bound IP addresses, and a webpage corresponding to the target webpage address is crawled through the target IP address. According to the scheme, the webpage addresses to be crawled correspond to the IP addresses bound by the terminal equipment, different IP addresses can be allocated to each webpage address to be crawled for webpage crawling, the situation that a single IP address is rejected when reaching the request upper limit is avoided, and the problem of IP limitation in a network crawler is effectively solved.

Drawings

Fig. 1 is a schematic hardware architecture diagram of an apparatus for acquiring web page content according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for acquiring web content according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for acquiring web content according to a second embodiment of the present invention;

fig. 4 is a flowchart illustrating a method for acquiring web page content according to a third embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As an implementation solution, please refer to fig. 1, fig. 1 is a schematic diagram of a hardware architecture of an apparatus for acquiring web content according to an embodiment of the present invention, and as shown in fig. 1, the apparatus for acquiring web content may include a processor 101, for example, a CPU, a memory 102, and a communication bus 103, where the communication bus 103 is used to implement connection communication between these modules.

The memory 102 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). As shown in fig. 1, a memory 102, which is a computer-readable storage medium, may include therein an acquisition program of web page content; and the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:

binding at least two IP addresses;

Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:

obtaining a browser test frame (Selenium);

generating a feature vector corresponding to the context feature;

generating an adjacency matrix corresponding to the spatial features;

acquiring preset dormancy duration;

training a preset graph neural network model by adopting the training set;

initializing a webpage address queue;

and adding the webpage address to be crawled to the webpage address queue.

With the development of the internet, how to effectively acquire and utilize webpage contents due to the fact that network resources have carriers of a large amount of information, the crawler technology plays a key role in this respect, and meanwhile, the crawler technology is accurate in information positioning, and can crawl the most appropriate content according to search requirements to push the content out. However, there is a fatal IP (Internet Protocol) restriction problem in the web crawler. The firewall of the website limits the number of times of requests of a certain fixed IP address in a certain period of time, if the number of times of requests of the certain fixed IP address does not exceed the upper limit, data is normally returned, and if the number of times of requests of the certain fixed IP address exceeds the upper limit, the requests are rejected. However, IP restrictions are sometimes not specific to web crawlers, but are mostly defensive measures against DOS (Denial of Service) attacks for website security reasons. Because the number of the used IP addresses is limited during background crawling, the web crawler easily reaches the request upper limit during crawling of the web content to cause that the request is rejected, and the acquisition efficiency of the web content is low.

Based on the technical problems in the prior art, the invention provides a method for acquiring webpage content, which is characterized in that a plurality of IP addresses under the same subnet are bound to a terminal device, when the webpage content is acquired by using a web crawler, the webpage addresses to be crawled stored in a webpage address queue are corresponding to the IP addresses bound by the terminal device, different IP addresses are distributed to each webpage address to be crawled for crawling the webpage content, the phenomenon that a single IP address is rejected when reaching a request upper limit is avoided, and the problem of IP limitation in the web crawler is solved. The following further explains the method for acquiring web page content according to the present invention by using specific embodiments.

Referring to fig. 2, fig. 2 is a schematic flowchart of a method for acquiring web page content according to a first embodiment of the present invention, where the method for acquiring web page content includes:

step S10, binding at least two IP addresses;

step S20, obtaining a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;

step S30, acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;

and step S40, crawling the webpage content corresponding to the target webpage address through the target IP address.

The execution main body of the method for acquiring the web content is the terminal device, optionally, the terminal device may be a fixed terminal such as a desktop computer, or may also be a mobile terminal such as a notebook computer, a tablet, and a mobile phone, of course, in other embodiments, the terminal device may also be other devices that can execute the web crawler operation, which is not limited in this embodiment.

The web page content crawled in this embodiment mainly refers to web page content of a social network, optionally, the social network may be a microblog, a WeChat, a known name, and the like.

In this embodiment, when web crawlers are used to obtain web page content of a social network, at least two IP addresses are bound to a terminal device, where the bound IP addresses are IPv6 addresses in the same subnet. Specifically, a plurality of IPv6 addresses can be randomly generated by using a random sample function in a random function library, where a specific code statement is str ═ join (random sample ('0123456789 abcdeff', 4)), and the function can obtain a 4-bit 16-ary IPv6 address segment once executed, and in practical application, the IPv6 address segments can be spliced with ": to obtain an IPv6 address specifically used for a network crawler. And after the IPv6 address is obtained, the obtained IPv6 address is bound with the terminal equipment.

And after the terminal equipment binds the IP address, acquiring a target webpage address from a webpage address queue, wherein the webpage address queue is used for storing the webpage address to be crawled, and the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the IP addresses bound by the terminal equipment. It should be noted that the web page addresses to be crawled stored in the web page address queue can be continuously updated in the web page crawling process, and as the web page crawling progresses, new web page addresses to be crawled are continuously added into the web page address queue and web page addresses which are crawled are continuously deleted.

After the terminal device obtains the target webpage address, a target IP address corresponding to the target webpage address is obtained from the IP addresses bound by the terminal device, and webpage content corresponding to the target webpage address is crawled through the target IP address.

Optionally, after the terminal device binds an IP address, initializing a web address queue, adding a to-be-crawled web address to the web address queue, then obtaining a target web address from the web address queue by the terminal device, obtaining a target IP address corresponding to the target web address from the IP address bound to the terminal device after obtaining the target web address, crawling web content corresponding to the target web address by the target IP address, updating a new web address crawled from the web content to a new to-be-crawled web address after crawling the web page, adding the new to-be-crawled web address to the web address queue, deleting the web address crawled from the web address queue, and repeating the above processes until the web address to be crawled does not exist in the web address queue, and the web crawler is finished. For example, after the terminal device initializes the web page address queue, a web page address to be crawled is added to the web page address queue, the terminal device obtains the web page address to be crawled as a target web page address from the web page address queue, the terminal device can allocate a first bound IP address as the target IP address to the target web page address, crawl the web page content corresponding to the target web page address through the allocated target IP address, after the crawling is completed, assume that 50 new web page addresses are crawled from the web page content, update the 50 new web page addresses to be crawled into new web page addresses, add the new web page addresses to be crawled to the web page address queue, and delete the web page addresses that have been crawled before, then the terminal device can sequentially obtain the 50 web page addresses to be crawled from the web page address queue as the 50 target web page addresses, and taking the first 50 IP addresses bound by the terminal equipment as target IP addresses, respectively allocating the target IP addresses to the 50 target webpage addresses, and respectively crawling the corresponding target webpage addresses through the allocated target IP addresses. The rule for allocating the IP address may be to allocate a first IP address to a first to-be-crawled web page address, allocate a second IP address to a second to-be-crawled web page address, and so on. And repeating the process until no IP address exists in the web page address queue, and finishing the web crawler.

It should be noted that, during IP address allocation, the local _ addr parameter of the aiohttp.tcpconector object may be modified to a new IP address, and the modified object may be stored by a conn variable. And then, when an asynchronous crawler task is constructed, transmitting a conn variable to a connector parameter of an aiohttp. ClientSession object to complete the allocation of the IP address.

Optionally, when crawling the web content corresponding to the target web address through the target IP address, the terminal device may crawl the web content corresponding to the target web address through the Selenium and the target IP address by acquiring a browser test frame Selenium, and in this embodiment, crawling the target web address through the Selenium may realize that the target web address directly runs in the browser, so as to simulate a real user behavior, thereby avoiding a reverse crawling mechanism and achieving a reverse crawling effect.

In the technical scheme provided by this embodiment, the terminal device obtains the target webpage address from the webpage address queue by binding the IP address, where the number of the to-be-crawled webpage addresses stored in the webpage address queue is less than or equal to the number of the bound IP addresses, obtains the target IP address corresponding to the target webpage address from the bound IP addresses, and crawls the webpage corresponding to the target webpage address through the target IP address. According to the scheme, the webpage addresses to be crawled correspond to the IP addresses bound by the terminal equipment, different IP addresses can be allocated to each webpage address to be crawled for webpage crawling, the situation that a single IP address is rejected when reaching the request upper limit is avoided, and the problem of IP limitation in a network crawler is effectively solved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the method for acquiring web page content according to the present invention, and based on the first embodiment, after the step of S40, the method further includes:

step S50, obtaining the context feature and the space feature of the webpage content;

step S60, generating a feature vector corresponding to the context feature;

step S70, generating an adjacent matrix corresponding to the spatial features;

step S80, inputting the feature vector and the adjacency matrix into a target graph neural network model to determine an output value of the target graph neural network model;

and step S90, determining the target dormancy duration of the current crawling action according to the output value.

In this embodiment, after crawling the web content of the target web address through the target IP address, the terminal device may obtain a context feature and a spatial feature of the web content, where the context feature of the web content is used to represent whether the web content is easy to read, and the spatial feature of the web content is used to represent a parent-child relationship between pages included in the web content.

After the terminal device obtains the context feature and the spatial feature of the web page content, the context feature of the web page content can be converted into a corresponding feature vector, and the spatial feature of the web page content can be converted into a corresponding adjacency matrix.

After the terminal equipment acquires the feature vector corresponding to the context feature and the adjacent matrix corresponding to the space feature, the feature vector corresponding to the context feature and the adjacent matrix corresponding to the space feature are input into the target graph neural network model to determine the output value of the target graph neural network model. And determining the target sleep duration of the current crawling action according to the output value of the target graph neural network model, wherein the target sleep duration of the current crawling action refers to the action time delay of the current crawling action.

Optionally, a corrected value of the sleep duration of the current crawling action is determined according to the output value of the target graph neural network model, meanwhile, a preset sleep duration of the current crawling action is obtained, and the preset sleep duration is corrected by the corrected value to obtain a target sleep duration of the current crawling action. The preset sleep duration may be set according to actual needs, which is not limited in this embodiment.

In the technical scheme provided by this embodiment, a feature vector corresponding to a context feature is generated by obtaining the context feature and a spatial feature of web page content, an adjacency matrix corresponding to the spatial feature is generated, the feature vector and the adjacency matrix are input into a target graph neural network model to determine an output value of the target graph neural network model, and a target sleep duration of a current crawling action is determined according to the output value. According to the scheme, an action time delay can be set for the current crawling action through the target graph neural network model and the characteristics of the crawled webpage content, the reading and searching actions of human beings can be simulated, the anti-crawling detection measures can be effectively avoided, and the acquisition efficiency of the webpage content is further improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for acquiring web content according to a third embodiment of the present invention, where based on the second embodiment, before the step of S10, the method further includes:

s100, acquiring a sample data set, wherein the sample data set comprises a training set and a test set;

step S200, training a neural network model of a preset graph by adopting the training set;

step S300, testing the trained preset graph neural network model by using the test set to determine an output value of the trained preset graph neural network model;

and step S400, when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.

In this embodiment, the terminal device may obtain a sample data set, where the sample data set includes a training set and a test set, and a ratio of the training set to the test set may be selected to be 9:1, that is, 90% of the sample data set is used as the training set and 10% is used as the test set.

Optionally, the terminal device may obtain the contextual characteristics and the spatial characteristics of the labeled web content, and determine the sample data set according to the contextual characteristics and the spatial characteristics of the labeled web content. Specifically, the terminal device may obtain the labeled web page content, perform word segmentation on the labeled web page content, segment the labeled web page content into words or phrases, and detect and extract chinese, english, and special characters therein to obtain contextual characteristics of the labeled web page content; meanwhile, the terminal device can analyze the parent-child relationship of the webpage contained in the labeled webpage content to obtain the spatial characteristics of the labeled webpage content. The terminal equipment can convert the context characteristics of the labeled webpage content into a characteristic vector, convert the spatial characteristics of the labeled webpage content into an adjacent matrix, divide the characteristic vector and the adjacent matrix into a training set and a test set according to a 9:1 distribution mode, and determine the obtained training set and the test set as a sample data set.

And after the terminal equipment acquires the sample data set, training a preset graph neural network model by adopting a training set to obtain the trained graph neural network model. The preset graph neural network model is composed of a convolutional layer, a pooling layer and a full-link layer, wherein the number of network layers, layer _ gnn, can be selected to be 3, an activation function can be selected to be a ReLU activation function, the full-link layer can be realized through a softmax function, the exponential decay learning rate can be selected to be 1e-3, the learning rate decay rate can be selected to be 0.5, the iteration number can be selected to be 100, and the decay is performed once every 10 rounds.

After the terminal equipment obtains the trained graph neural network model, testing the trained graph neural network model by using a test set to determine an output value of the trained graph neural network model, then judging whether the output value of the trained graph neural network model is in a preset range, determining the trained graph neural network model as a target graph neural network model when the output value of the trained graph neural network model is in the preset range, wherein the training parameters of the preset graph neural network model are adjusted when the output value of the trained graph neural network model is not in the preset range, training the preset graph neural network model by using the training set again, and repeating the process until the target graph neural network model is obtained. The preset range may be set according to actual needs, and this embodiment does not limit this.

In the technical scheme provided by this embodiment, a sample data set is obtained, where the sample data set includes a training set and a test set, the training set is used to train the preset-diagram neural network model, the test set is used to test the trained preset-diagram neural network model so as to determine an output value of the trained preset-diagram neural network model, and when the output value of the trained preset-diagram neural network model is within a preset range, the trained preset-diagram neural network model is determined as the target-diagram neural network model. According to the scheme, the target graph neural network model is obtained through training, the identification precision of the target graph neural network model can be improved, the dormancy duration of the current crawling action determined through the target graph neural network model is more accurate, and the webpage content acquisition efficiency is further improved.

Based on the foregoing embodiments, the present invention further provides an apparatus for acquiring web content, where the apparatus for acquiring web content may include a memory, a processor, and a web content acquisition program stored in the memory and executable on the processor, and when the processor executes the web content acquisition program, the steps of the method for acquiring web content according to any of the foregoing embodiments are implemented.

Based on the foregoing embodiments, the present invention further provides a readable storage medium, on which an acquiring program of web content is stored, where the acquiring program of web content implements the steps of the acquiring method of web content according to any one of the foregoing embodiments when executed by a processor.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a smart tv, a mobile phone, a computer, etc.) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for acquiring web content, wherein the method for acquiring web content comprises:

Bind at least two IP addresses;

Obtain the target webpage address from the webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;

Obtain the target IP address corresponding to the target webpage address from the bound IP address;

Crawl the webpage content corresponding to the target webpage address by using the target IP address.

2 . The method for obtaining webpage content according to claim 1 , wherein the bound IP address is an IPv6 address under the same subnet. 3 .

3. The method for obtaining webpage content as claimed in claim 1, wherein before the step of crawling the webpage content corresponding to the target webpage address by the target IP address, the method further comprises:

Get the browser testing framework Selenium;

The step of crawling the webpage content corresponding to the target webpage address through the target IP address includes:

Crawl the webpage content corresponding to the target webpage address by using the Selenium and the target IP address.

4. The method for obtaining webpage content according to claim 1, wherein after the step of crawling the webpage content corresponding to the target webpage address by the target IP address, the method further comprises:

Obtain the contextual features and spatial features of the webpage content;

generating a feature vector corresponding to the context feature;

generating an adjacency matrix corresponding to the spatial feature;

Inputting the feature vector and the adjacency matrix into the target graph neural network model to determine the output value of the target graph neural network model;

The target sleep duration of the current crawling action is determined according to the output value.

5. The method for obtaining webpage content according to claim 4, wherein the step of determining the target sleep duration of the current crawling action according to the output value comprises:

Determine the correction value of the sleep duration of the current crawling action according to the output value;

Get the preset sleep duration;

Using the correction value to correct the preset sleep duration to obtain the target sleep duration.

6. The method for obtaining webpage content according to claim 4, wherein before the step of binding at least two IP addresses, the method further comprises:

Obtain a sample data set, wherein the sample data set includes a training set and a test set;

Use the training set to train a preset graph neural network model;

Using the test set to test the trained preset graph neural network model to determine the output value of the trained preset graph neural network model;

When the output value of the trained preset graph neural network model is within a preset range, the trained preset graph neural network model is determined as the target graph neural network model.

7. The method for obtaining webpage content according to claim 6, wherein the step of obtaining a sample data set comprises:

Obtain the contextual features and spatial features of the marked web page content;

The sample data set is determined according to the marked context features and spatial features of the webpage content.

8. The method for obtaining webpage content according to claim 1, wherein before the step of obtaining the target webpage address from the webpage address queue, the method further comprises:

Initialize the web page address queue;

adding the address of the webpage to be crawled to the webpage address queue.

9. A terminal device, characterized in that the terminal device comprises a memory, a processor, and a program for acquiring web page content that is stored on the memory and can be run on the processor, and the program that acquires the web page content When executed by the processor, the steps of implementing the method for obtaining webpage content according to any one of claims 1-8.

10. A readable storage medium, wherein the readable storage medium stores a program for acquiring web page content, and when the program for acquiring web page content is executed by a processor, any one of claims 1-8 is implemented. The steps of the method for acquiring webpage content described in item 1.