Disclosure of Invention
The invention mainly aims to provide a method for acquiring webpage content, a terminal device and a readable storage medium, and aims to improve the acquisition efficiency of the webpage content.
In order to achieve the above object, the present invention provides a method for acquiring web page content, where the method for acquiring web page content includes:
binding at least two IP addresses;
acquiring a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;
acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;
and crawling the webpage content corresponding to the target webpage address through the target IP address.
Optionally, the bound IP address is an IPv6 address under the same subnet.
Optionally, before the step of crawling the web content corresponding to the target web address by using the target IP address, the method further includes:
obtaining a browser test frame (Selenium);
the step of crawling the web page content corresponding to the target web page address through the target IP address comprises the following steps:
and crawling the webpage content corresponding to the target webpage address through the Selenium and the target IP address.
Optionally, after the step of crawling the web content corresponding to the target web address by using the target IP address, the method further includes:
acquiring the context characteristics and the space characteristics of the webpage content;
generating a feature vector corresponding to the context feature;
generating an adjacency matrix corresponding to the spatial features;
inputting the feature vectors and the adjacency matrix into a target graph neural network model to determine output values of the target graph neural network model;
and determining the target dormancy duration of the current crawling action according to the output value.
Optionally, the step of determining a target sleep duration of the current crawling action according to the output value includes:
determining a modification value of the dormancy duration of the current crawling action according to the output value;
acquiring preset dormancy duration;
and correcting the preset dormancy duration by adopting the correction value to obtain the target dormancy duration.
Optionally, before the step of binding at least two IP addresses, the method further includes:
acquiring a sample data set, wherein the sample data set comprises a training set and a test set;
training a preset graph neural network model by adopting the training set;
testing the trained preset map neural network model by using the test set to determine an output value of the trained preset map neural network model;
and when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.
Optionally, the step of acquiring the sample data set includes:
obtaining context characteristics and space characteristics of the marked webpage content;
and determining the sample data set according to the marked contextual characteristics and spatial characteristics of the webpage content.
Optionally, before the step of obtaining the target webpage address from the webpage address queue, the method further includes:
initializing a webpage address queue;
and adding the webpage address to be crawled to the webpage address queue.
In addition, in order to achieve the above object, the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a program for acquiring web content stored in the memory and executable on the processor, and the program for acquiring web content is executed by the processor to implement any one of the steps of the method for acquiring web content.
In addition, to achieve the above object, the present invention further provides a readable storage medium, in which a program for acquiring web content is stored, and the program for acquiring web content, when executed by a processor, implements the steps of the method for acquiring web content according to any one of the above items.
The invention provides a method for acquiring webpage content, a terminal device and a readable storage medium, wherein the terminal device acquires a target webpage address from a webpage address queue by binding IP addresses, the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses, the target IP address corresponding to the target webpage address is acquired from the bound IP addresses, and a webpage corresponding to the target webpage address is crawled through the target IP address. According to the scheme, the webpage addresses to be crawled correspond to the IP addresses bound by the terminal equipment, different IP addresses can be allocated to each webpage address to be crawled for webpage crawling, the situation that a single IP address is rejected when reaching the request upper limit is avoided, and the problem of IP limitation in a network crawler is effectively solved.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As an implementation solution, please refer to fig. 1, fig. 1 is a schematic diagram of a hardware architecture of an apparatus for acquiring web content according to an embodiment of the present invention, and as shown in fig. 1, the apparatus for acquiring web content may include a processor 101, for example, a CPU, a memory 102, and a communication bus 103, where the communication bus 103 is used to implement connection communication between these modules.
The memory 102 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). As shown in fig. 1, a memory 102, which is a computer-readable storage medium, may include therein an acquisition program of web page content; and the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
binding at least two IP addresses;
acquiring a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;
acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;
and crawling the webpage content corresponding to the target webpage address through the target IP address.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
obtaining a browser test frame (Selenium);
the step of crawling the web page content corresponding to the target web page address through the target IP address comprises the following steps:
and crawling the webpage content corresponding to the target webpage address through the Selenium and the target IP address.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
acquiring the context characteristics and the space characteristics of the webpage content;
generating a feature vector corresponding to the context feature;
generating an adjacency matrix corresponding to the spatial features;
inputting the feature vectors and the adjacency matrix into a target graph neural network model to determine output values of the target graph neural network model;
and determining the target dormancy duration of the current crawling action according to the output value.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
determining a modification value of the dormancy duration of the current crawling action according to the output value;
acquiring preset dormancy duration;
and correcting the preset dormancy duration by adopting the correction value to obtain the target dormancy duration.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
acquiring a sample data set, wherein the sample data set comprises a training set and a test set;
training a preset graph neural network model by adopting the training set;
testing the trained preset map neural network model by using the test set to determine an output value of the trained preset map neural network model;
and when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
obtaining context characteristics and space characteristics of the marked webpage content;
and determining the sample data set according to the marked contextual characteristics and spatial characteristics of the webpage content.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
initializing a webpage address queue;
and adding the webpage address to be crawled to the webpage address queue.
With the development of the internet, how to effectively acquire and utilize webpage contents due to the fact that network resources have carriers of a large amount of information, the crawler technology plays a key role in this respect, and meanwhile, the crawler technology is accurate in information positioning, and can crawl the most appropriate content according to search requirements to push the content out. However, there is a fatal IP (Internet Protocol) restriction problem in the web crawler. The firewall of the website limits the number of times of requests of a certain fixed IP address in a certain period of time, if the number of times of requests of the certain fixed IP address does not exceed the upper limit, data is normally returned, and if the number of times of requests of the certain fixed IP address exceeds the upper limit, the requests are rejected. However, IP restrictions are sometimes not specific to web crawlers, but are mostly defensive measures against DOS (Denial of Service) attacks for website security reasons. Because the number of the used IP addresses is limited during background crawling, the web crawler easily reaches the request upper limit during crawling of the web content to cause that the request is rejected, and the acquisition efficiency of the web content is low.
Based on the technical problems in the prior art, the invention provides a method for acquiring webpage content, which is characterized in that a plurality of IP addresses under the same subnet are bound to a terminal device, when the webpage content is acquired by using a web crawler, the webpage addresses to be crawled stored in a webpage address queue are corresponding to the IP addresses bound by the terminal device, different IP addresses are distributed to each webpage address to be crawled for crawling the webpage content, the phenomenon that a single IP address is rejected when reaching a request upper limit is avoided, and the problem of IP limitation in the web crawler is solved. The following further explains the method for acquiring web page content according to the present invention by using specific embodiments.
Referring to fig. 2, fig. 2 is a schematic flowchart of a method for acquiring web page content according to a first embodiment of the present invention, where the method for acquiring web page content includes:
step S10, binding at least two IP addresses;
step S20, obtaining a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;
step S30, acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;
and step S40, crawling the webpage content corresponding to the target webpage address through the target IP address.
The execution main body of the method for acquiring the web content is the terminal device, optionally, the terminal device may be a fixed terminal such as a desktop computer, or may also be a mobile terminal such as a notebook computer, a tablet, and a mobile phone, of course, in other embodiments, the terminal device may also be other devices that can execute the web crawler operation, which is not limited in this embodiment.
The web page content crawled in this embodiment mainly refers to web page content of a social network, optionally, the social network may be a microblog, a WeChat, a known name, and the like.
In this embodiment, when web crawlers are used to obtain web page content of a social network, at least two IP addresses are bound to a terminal device, where the bound IP addresses are IPv6 addresses in the same subnet. Specifically, a plurality of IPv6 addresses can be randomly generated by using a random sample function in a random function library, where a specific code statement is str ═ join (random sample ('0123456789 abcdeff', 4)), and the function can obtain a 4-bit 16-ary IPv6 address segment once executed, and in practical application, the IPv6 address segments can be spliced with ": to obtain an IPv6 address specifically used for a network crawler. And after the IPv6 address is obtained, the obtained IPv6 address is bound with the terminal equipment.
And after the terminal equipment binds the IP address, acquiring a target webpage address from a webpage address queue, wherein the webpage address queue is used for storing the webpage address to be crawled, and the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the IP addresses bound by the terminal equipment. It should be noted that the web page addresses to be crawled stored in the web page address queue can be continuously updated in the web page crawling process, and as the web page crawling progresses, new web page addresses to be crawled are continuously added into the web page address queue and web page addresses which are crawled are continuously deleted.
After the terminal device obtains the target webpage address, a target IP address corresponding to the target webpage address is obtained from the IP addresses bound by the terminal device, and webpage content corresponding to the target webpage address is crawled through the target IP address.
Optionally, after the terminal device binds an IP address, initializing a web address queue, adding a to-be-crawled web address to the web address queue, then obtaining a target web address from the web address queue by the terminal device, obtaining a target IP address corresponding to the target web address from the IP address bound to the terminal device after obtaining the target web address, crawling web content corresponding to the target web address by the target IP address, updating a new web address crawled from the web content to a new to-be-crawled web address after crawling the web page, adding the new to-be-crawled web address to the web address queue, deleting the web address crawled from the web address queue, and repeating the above processes until the web address to be crawled does not exist in the web address queue, and the web crawler is finished. For example, after the terminal device initializes the web page address queue, a web page address to be crawled is added to the web page address queue, the terminal device obtains the web page address to be crawled as a target web page address from the web page address queue, the terminal device can allocate a first bound IP address as the target IP address to the target web page address, crawl the web page content corresponding to the target web page address through the allocated target IP address, after the crawling is completed, assume that 50 new web page addresses are crawled from the web page content, update the 50 new web page addresses to be crawled into new web page addresses, add the new web page addresses to be crawled to the web page address queue, and delete the web page addresses that have been crawled before, then the terminal device can sequentially obtain the 50 web page addresses to be crawled from the web page address queue as the 50 target web page addresses, and taking the first 50 IP addresses bound by the terminal equipment as target IP addresses, respectively allocating the target IP addresses to the 50 target webpage addresses, and respectively crawling the corresponding target webpage addresses through the allocated target IP addresses. The rule for allocating the IP address may be to allocate a first IP address to a first to-be-crawled web page address, allocate a second IP address to a second to-be-crawled web page address, and so on. And repeating the process until no IP address exists in the web page address queue, and finishing the web crawler.
It should be noted that, during IP address allocation, the local _ addr parameter of the aiohttp.tcpconector object may be modified to a new IP address, and the modified object may be stored by a conn variable. And then, when an asynchronous crawler task is constructed, transmitting a conn variable to a connector parameter of an aiohttp. ClientSession object to complete the allocation of the IP address.
Optionally, when crawling the web content corresponding to the target web address through the target IP address, the terminal device may crawl the web content corresponding to the target web address through the Selenium and the target IP address by acquiring a browser test frame Selenium, and in this embodiment, crawling the target web address through the Selenium may realize that the target web address directly runs in the browser, so as to simulate a real user behavior, thereby avoiding a reverse crawling mechanism and achieving a reverse crawling effect.
In the technical scheme provided by this embodiment, the terminal device obtains the target webpage address from the webpage address queue by binding the IP address, where the number of the to-be-crawled webpage addresses stored in the webpage address queue is less than or equal to the number of the bound IP addresses, obtains the target IP address corresponding to the target webpage address from the bound IP addresses, and crawls the webpage corresponding to the target webpage address through the target IP address. According to the scheme, the webpage addresses to be crawled correspond to the IP addresses bound by the terminal equipment, different IP addresses can be allocated to each webpage address to be crawled for webpage crawling, the situation that a single IP address is rejected when reaching the request upper limit is avoided, and the problem of IP limitation in a network crawler is effectively solved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the method for acquiring web page content according to the present invention, and based on the first embodiment, after the step of S40, the method further includes:
step S50, obtaining the context feature and the space feature of the webpage content;
step S60, generating a feature vector corresponding to the context feature;
step S70, generating an adjacent matrix corresponding to the spatial features;
step S80, inputting the feature vector and the adjacency matrix into a target graph neural network model to determine an output value of the target graph neural network model;
and step S90, determining the target dormancy duration of the current crawling action according to the output value.
In this embodiment, after crawling the web content of the target web address through the target IP address, the terminal device may obtain a context feature and a spatial feature of the web content, where the context feature of the web content is used to represent whether the web content is easy to read, and the spatial feature of the web content is used to represent a parent-child relationship between pages included in the web content.
After the terminal device obtains the context feature and the spatial feature of the web page content, the context feature of the web page content can be converted into a corresponding feature vector, and the spatial feature of the web page content can be converted into a corresponding adjacency matrix.
After the terminal equipment acquires the feature vector corresponding to the context feature and the adjacent matrix corresponding to the space feature, the feature vector corresponding to the context feature and the adjacent matrix corresponding to the space feature are input into the target graph neural network model to determine the output value of the target graph neural network model. And determining the target sleep duration of the current crawling action according to the output value of the target graph neural network model, wherein the target sleep duration of the current crawling action refers to the action time delay of the current crawling action.
Optionally, a corrected value of the sleep duration of the current crawling action is determined according to the output value of the target graph neural network model, meanwhile, a preset sleep duration of the current crawling action is obtained, and the preset sleep duration is corrected by the corrected value to obtain a target sleep duration of the current crawling action. The preset sleep duration may be set according to actual needs, which is not limited in this embodiment.
In the technical scheme provided by this embodiment, a feature vector corresponding to a context feature is generated by obtaining the context feature and a spatial feature of web page content, an adjacency matrix corresponding to the spatial feature is generated, the feature vector and the adjacency matrix are input into a target graph neural network model to determine an output value of the target graph neural network model, and a target sleep duration of a current crawling action is determined according to the output value. According to the scheme, an action time delay can be set for the current crawling action through the target graph neural network model and the characteristics of the crawled webpage content, the reading and searching actions of human beings can be simulated, the anti-crawling detection measures can be effectively avoided, and the acquisition efficiency of the webpage content is further improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a method for acquiring web content according to a third embodiment of the present invention, where based on the second embodiment, before the step of S10, the method further includes:
s100, acquiring a sample data set, wherein the sample data set comprises a training set and a test set;
step S200, training a neural network model of a preset graph by adopting the training set;
step S300, testing the trained preset graph neural network model by using the test set to determine an output value of the trained preset graph neural network model;
and step S400, when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.
In this embodiment, the terminal device may obtain a sample data set, where the sample data set includes a training set and a test set, and a ratio of the training set to the test set may be selected to be 9:1, that is, 90% of the sample data set is used as the training set and 10% is used as the test set.
Optionally, the terminal device may obtain the contextual characteristics and the spatial characteristics of the labeled web content, and determine the sample data set according to the contextual characteristics and the spatial characteristics of the labeled web content. Specifically, the terminal device may obtain the labeled web page content, perform word segmentation on the labeled web page content, segment the labeled web page content into words or phrases, and detect and extract chinese, english, and special characters therein to obtain contextual characteristics of the labeled web page content; meanwhile, the terminal device can analyze the parent-child relationship of the webpage contained in the labeled webpage content to obtain the spatial characteristics of the labeled webpage content. The terminal equipment can convert the context characteristics of the labeled webpage content into a characteristic vector, convert the spatial characteristics of the labeled webpage content into an adjacent matrix, divide the characteristic vector and the adjacent matrix into a training set and a test set according to a 9:1 distribution mode, and determine the obtained training set and the test set as a sample data set.
And after the terminal equipment acquires the sample data set, training a preset graph neural network model by adopting a training set to obtain the trained graph neural network model. The preset graph neural network model is composed of a convolutional layer, a pooling layer and a full-link layer, wherein the number of network layers, layer _ gnn, can be selected to be 3, an activation function can be selected to be a ReLU activation function, the full-link layer can be realized through a softmax function, the exponential decay learning rate can be selected to be 1e-3, the learning rate decay rate can be selected to be 0.5, the iteration number can be selected to be 100, and the decay is performed once every 10 rounds.
After the terminal equipment obtains the trained graph neural network model, testing the trained graph neural network model by using a test set to determine an output value of the trained graph neural network model, then judging whether the output value of the trained graph neural network model is in a preset range, determining the trained graph neural network model as a target graph neural network model when the output value of the trained graph neural network model is in the preset range, wherein the training parameters of the preset graph neural network model are adjusted when the output value of the trained graph neural network model is not in the preset range, training the preset graph neural network model by using the training set again, and repeating the process until the target graph neural network model is obtained. The preset range may be set according to actual needs, and this embodiment does not limit this.
In the technical scheme provided by this embodiment, a sample data set is obtained, where the sample data set includes a training set and a test set, the training set is used to train the preset-diagram neural network model, the test set is used to test the trained preset-diagram neural network model so as to determine an output value of the trained preset-diagram neural network model, and when the output value of the trained preset-diagram neural network model is within a preset range, the trained preset-diagram neural network model is determined as the target-diagram neural network model. According to the scheme, the target graph neural network model is obtained through training, the identification precision of the target graph neural network model can be improved, the dormancy duration of the current crawling action determined through the target graph neural network model is more accurate, and the webpage content acquisition efficiency is further improved.
Based on the foregoing embodiments, the present invention further provides an apparatus for acquiring web content, where the apparatus for acquiring web content may include a memory, a processor, and a web content acquisition program stored in the memory and executable on the processor, and when the processor executes the web content acquisition program, the steps of the method for acquiring web content according to any of the foregoing embodiments are implemented.
Based on the foregoing embodiments, the present invention further provides a readable storage medium, on which an acquiring program of web content is stored, where the acquiring program of web content implements the steps of the acquiring method of web content according to any one of the foregoing embodiments when executed by a processor.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a smart tv, a mobile phone, a computer, etc.) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.