[go: up one dir, main page]

CN113821705A - Webpage content acquisition method, terminal equipment and readable storage medium - Google Patents

Webpage content acquisition method, terminal equipment and readable storage medium Download PDF

Info

Publication number
CN113821705A
CN113821705A CN202111007979.1A CN202111007979A CN113821705A CN 113821705 A CN113821705 A CN 113821705A CN 202111007979 A CN202111007979 A CN 202111007979A CN 113821705 A CN113821705 A CN 113821705A
Authority
CN
China
Prior art keywords
address
webpage
target
content
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111007979.1A
Other languages
Chinese (zh)
Other versions
CN113821705B (en
Inventor
蒋林钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202111007979.1A priority Critical patent/CN113821705B/en
Publication of CN113821705A publication Critical patent/CN113821705A/en
Application granted granted Critical
Publication of CN113821705B publication Critical patent/CN113821705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种网页内容的获取方法、终端设备及可读存储介质,所述网页内容的获取方法包括:绑定至少两个IP地址;从网页地址队列中获取目标网页地址,其中,所述网页地址队列中存放的待爬取网页地址的数量小于或等于绑定的所述IP地址的数量;在绑定的所述IP地址中获取所述目标网页地址对应的目标IP地址;通过所述目标IP地址爬取所述目标网页地址对应的网页内容。本发明能够提高网页内容的获取效率。

Figure 202111007979

The invention discloses a method for obtaining webpage content, a terminal device and a readable storage medium. The method for obtaining webpage content includes: binding at least two IP addresses; obtaining a target webpage address from a webpage address queue, wherein the The number of webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses; obtain the target IP address corresponding to the target webpage address in the bound IP address; The target IP address crawls the webpage content corresponding to the target webpage address. The invention can improve the acquisition efficiency of webpage content.

Figure 202111007979

Description

Webpage content acquisition method, terminal equipment and readable storage medium
Technical Field
The invention relates to the technical field of web crawlers, in particular to a method for acquiring webpage content, a terminal device and a readable storage medium.
Background
At present, web crawlers are usually used to acquire web page content, but the most fatal problem in web crawlers is the IP (Internet Protocol) restriction problem. The number of times a single IP address is allowed to request a given website within a certain period of time is limited, and if the number of requests exceeds an upper limit, the request will be identified as a crawler by the website and rejected. Of course, the IP limitation is not specific to web crawlers, but is also a measure for preventing DoS (Denial of Service) attacks. The number of IP addresses used by the common asynchronous web crawler in crawling is limited, and the web crawler is easy to reach the request upper limit to cause the request to be rejected, so that the acquisition efficiency of the web page content is low.
Disclosure of Invention
The invention mainly aims to provide a method for acquiring webpage content, a terminal device and a readable storage medium, and aims to improve the acquisition efficiency of the webpage content.
In order to achieve the above object, the present invention provides a method for acquiring web page content, where the method for acquiring web page content includes:
binding at least two IP addresses;
acquiring a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;
acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;
and crawling the webpage content corresponding to the target webpage address through the target IP address.
Optionally, the bound IP address is an IPv6 address under the same subnet.
Optionally, before the step of crawling the web content corresponding to the target web address by using the target IP address, the method further includes:
obtaining a browser test frame (Selenium);
the step of crawling the web page content corresponding to the target web page address through the target IP address comprises the following steps:
and crawling the webpage content corresponding to the target webpage address through the Selenium and the target IP address.
Optionally, after the step of crawling the web content corresponding to the target web address by using the target IP address, the method further includes:
acquiring the context characteristics and the space characteristics of the webpage content;
generating a feature vector corresponding to the context feature;
generating an adjacency matrix corresponding to the spatial features;
inputting the feature vectors and the adjacency matrix into a target graph neural network model to determine output values of the target graph neural network model;
and determining the target dormancy duration of the current crawling action according to the output value.
Optionally, the step of determining a target sleep duration of the current crawling action according to the output value includes:
determining a modification value of the dormancy duration of the current crawling action according to the output value;
acquiring preset dormancy duration;
and correcting the preset dormancy duration by adopting the correction value to obtain the target dormancy duration.
Optionally, before the step of binding at least two IP addresses, the method further includes:
acquiring a sample data set, wherein the sample data set comprises a training set and a test set;
training a preset graph neural network model by adopting the training set;
testing the trained preset map neural network model by using the test set to determine an output value of the trained preset map neural network model;
and when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.
Optionally, the step of acquiring the sample data set includes:
obtaining context characteristics and space characteristics of the marked webpage content;
and determining the sample data set according to the marked contextual characteristics and spatial characteristics of the webpage content.
Optionally, before the step of obtaining the target webpage address from the webpage address queue, the method further includes:
initializing a webpage address queue;
and adding the webpage address to be crawled to the webpage address queue.
In addition, in order to achieve the above object, the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a program for acquiring web content stored in the memory and executable on the processor, and the program for acquiring web content is executed by the processor to implement any one of the steps of the method for acquiring web content.
In addition, to achieve the above object, the present invention further provides a readable storage medium, in which a program for acquiring web content is stored, and the program for acquiring web content, when executed by a processor, implements the steps of the method for acquiring web content according to any one of the above items.
The invention provides a method for acquiring webpage content, a terminal device and a readable storage medium, wherein the terminal device acquires a target webpage address from a webpage address queue by binding IP addresses, the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses, the target IP address corresponding to the target webpage address is acquired from the bound IP addresses, and a webpage corresponding to the target webpage address is crawled through the target IP address. According to the scheme, the webpage addresses to be crawled correspond to the IP addresses bound by the terminal equipment, different IP addresses can be allocated to each webpage address to be crawled for webpage crawling, the situation that a single IP address is rejected when reaching the request upper limit is avoided, and the problem of IP limitation in a network crawler is effectively solved.
Drawings
Fig. 1 is a schematic hardware architecture diagram of an apparatus for acquiring web page content according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for acquiring web content according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for acquiring web content according to a second embodiment of the present invention;
fig. 4 is a flowchart illustrating a method for acquiring web page content according to a third embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As an implementation solution, please refer to fig. 1, fig. 1 is a schematic diagram of a hardware architecture of an apparatus for acquiring web content according to an embodiment of the present invention, and as shown in fig. 1, the apparatus for acquiring web content may include a processor 101, for example, a CPU, a memory 102, and a communication bus 103, where the communication bus 103 is used to implement connection communication between these modules.
The memory 102 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). As shown in fig. 1, a memory 102, which is a computer-readable storage medium, may include therein an acquisition program of web page content; and the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
binding at least two IP addresses;
acquiring a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;
acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;
and crawling the webpage content corresponding to the target webpage address through the target IP address.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
obtaining a browser test frame (Selenium);
the step of crawling the web page content corresponding to the target web page address through the target IP address comprises the following steps:
and crawling the webpage content corresponding to the target webpage address through the Selenium and the target IP address.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
acquiring the context characteristics and the space characteristics of the webpage content;
generating a feature vector corresponding to the context feature;
generating an adjacency matrix corresponding to the spatial features;
inputting the feature vectors and the adjacency matrix into a target graph neural network model to determine output values of the target graph neural network model;
and determining the target dormancy duration of the current crawling action according to the output value.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
determining a modification value of the dormancy duration of the current crawling action according to the output value;
acquiring preset dormancy duration;
and correcting the preset dormancy duration by adopting the correction value to obtain the target dormancy duration.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
acquiring a sample data set, wherein the sample data set comprises a training set and a test set;
training a preset graph neural network model by adopting the training set;
testing the trained preset map neural network model by using the test set to determine an output value of the trained preset map neural network model;
and when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
obtaining context characteristics and space characteristics of the marked webpage content;
and determining the sample data set according to the marked contextual characteristics and spatial characteristics of the webpage content.
Further, the processor 101 may be configured to call the obtaining program of the web page content stored in the memory 102, and perform the following operations:
initializing a webpage address queue;
and adding the webpage address to be crawled to the webpage address queue.
With the development of the internet, how to effectively acquire and utilize webpage contents due to the fact that network resources have carriers of a large amount of information, the crawler technology plays a key role in this respect, and meanwhile, the crawler technology is accurate in information positioning, and can crawl the most appropriate content according to search requirements to push the content out. However, there is a fatal IP (Internet Protocol) restriction problem in the web crawler. The firewall of the website limits the number of times of requests of a certain fixed IP address in a certain period of time, if the number of times of requests of the certain fixed IP address does not exceed the upper limit, data is normally returned, and if the number of times of requests of the certain fixed IP address exceeds the upper limit, the requests are rejected. However, IP restrictions are sometimes not specific to web crawlers, but are mostly defensive measures against DOS (Denial of Service) attacks for website security reasons. Because the number of the used IP addresses is limited during background crawling, the web crawler easily reaches the request upper limit during crawling of the web content to cause that the request is rejected, and the acquisition efficiency of the web content is low.
Based on the technical problems in the prior art, the invention provides a method for acquiring webpage content, which is characterized in that a plurality of IP addresses under the same subnet are bound to a terminal device, when the webpage content is acquired by using a web crawler, the webpage addresses to be crawled stored in a webpage address queue are corresponding to the IP addresses bound by the terminal device, different IP addresses are distributed to each webpage address to be crawled for crawling the webpage content, the phenomenon that a single IP address is rejected when reaching a request upper limit is avoided, and the problem of IP limitation in the web crawler is solved. The following further explains the method for acquiring web page content according to the present invention by using specific embodiments.
Referring to fig. 2, fig. 2 is a schematic flowchart of a method for acquiring web page content according to a first embodiment of the present invention, where the method for acquiring web page content includes:
step S10, binding at least two IP addresses;
step S20, obtaining a target webpage address from a webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses;
step S30, acquiring a target IP address corresponding to the target webpage address from the bound IP addresses;
and step S40, crawling the webpage content corresponding to the target webpage address through the target IP address.
The execution main body of the method for acquiring the web content is the terminal device, optionally, the terminal device may be a fixed terminal such as a desktop computer, or may also be a mobile terminal such as a notebook computer, a tablet, and a mobile phone, of course, in other embodiments, the terminal device may also be other devices that can execute the web crawler operation, which is not limited in this embodiment.
The web page content crawled in this embodiment mainly refers to web page content of a social network, optionally, the social network may be a microblog, a WeChat, a known name, and the like.
In this embodiment, when web crawlers are used to obtain web page content of a social network, at least two IP addresses are bound to a terminal device, where the bound IP addresses are IPv6 addresses in the same subnet. Specifically, a plurality of IPv6 addresses can be randomly generated by using a random sample function in a random function library, where a specific code statement is str ═ join (random sample ('0123456789 abcdeff', 4)), and the function can obtain a 4-bit 16-ary IPv6 address segment once executed, and in practical application, the IPv6 address segments can be spliced with ": to obtain an IPv6 address specifically used for a network crawler. And after the IPv6 address is obtained, the obtained IPv6 address is bound with the terminal equipment.
And after the terminal equipment binds the IP address, acquiring a target webpage address from a webpage address queue, wherein the webpage address queue is used for storing the webpage address to be crawled, and the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the IP addresses bound by the terminal equipment. It should be noted that the web page addresses to be crawled stored in the web page address queue can be continuously updated in the web page crawling process, and as the web page crawling progresses, new web page addresses to be crawled are continuously added into the web page address queue and web page addresses which are crawled are continuously deleted.
After the terminal device obtains the target webpage address, a target IP address corresponding to the target webpage address is obtained from the IP addresses bound by the terminal device, and webpage content corresponding to the target webpage address is crawled through the target IP address.
Optionally, after the terminal device binds an IP address, initializing a web address queue, adding a to-be-crawled web address to the web address queue, then obtaining a target web address from the web address queue by the terminal device, obtaining a target IP address corresponding to the target web address from the IP address bound to the terminal device after obtaining the target web address, crawling web content corresponding to the target web address by the target IP address, updating a new web address crawled from the web content to a new to-be-crawled web address after crawling the web page, adding the new to-be-crawled web address to the web address queue, deleting the web address crawled from the web address queue, and repeating the above processes until the web address to be crawled does not exist in the web address queue, and the web crawler is finished. For example, after the terminal device initializes the web page address queue, a web page address to be crawled is added to the web page address queue, the terminal device obtains the web page address to be crawled as a target web page address from the web page address queue, the terminal device can allocate a first bound IP address as the target IP address to the target web page address, crawl the web page content corresponding to the target web page address through the allocated target IP address, after the crawling is completed, assume that 50 new web page addresses are crawled from the web page content, update the 50 new web page addresses to be crawled into new web page addresses, add the new web page addresses to be crawled to the web page address queue, and delete the web page addresses that have been crawled before, then the terminal device can sequentially obtain the 50 web page addresses to be crawled from the web page address queue as the 50 target web page addresses, and taking the first 50 IP addresses bound by the terminal equipment as target IP addresses, respectively allocating the target IP addresses to the 50 target webpage addresses, and respectively crawling the corresponding target webpage addresses through the allocated target IP addresses. The rule for allocating the IP address may be to allocate a first IP address to a first to-be-crawled web page address, allocate a second IP address to a second to-be-crawled web page address, and so on. And repeating the process until no IP address exists in the web page address queue, and finishing the web crawler.
It should be noted that, during IP address allocation, the local _ addr parameter of the aiohttp.tcpconector object may be modified to a new IP address, and the modified object may be stored by a conn variable. And then, when an asynchronous crawler task is constructed, transmitting a conn variable to a connector parameter of an aiohttp. ClientSession object to complete the allocation of the IP address.
Optionally, when crawling the web content corresponding to the target web address through the target IP address, the terminal device may crawl the web content corresponding to the target web address through the Selenium and the target IP address by acquiring a browser test frame Selenium, and in this embodiment, crawling the target web address through the Selenium may realize that the target web address directly runs in the browser, so as to simulate a real user behavior, thereby avoiding a reverse crawling mechanism and achieving a reverse crawling effect.
In the technical scheme provided by this embodiment, the terminal device obtains the target webpage address from the webpage address queue by binding the IP address, where the number of the to-be-crawled webpage addresses stored in the webpage address queue is less than or equal to the number of the bound IP addresses, obtains the target IP address corresponding to the target webpage address from the bound IP addresses, and crawls the webpage corresponding to the target webpage address through the target IP address. According to the scheme, the webpage addresses to be crawled correspond to the IP addresses bound by the terminal equipment, different IP addresses can be allocated to each webpage address to be crawled for webpage crawling, the situation that a single IP address is rejected when reaching the request upper limit is avoided, and the problem of IP limitation in a network crawler is effectively solved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the method for acquiring web page content according to the present invention, and based on the first embodiment, after the step of S40, the method further includes:
step S50, obtaining the context feature and the space feature of the webpage content;
step S60, generating a feature vector corresponding to the context feature;
step S70, generating an adjacent matrix corresponding to the spatial features;
step S80, inputting the feature vector and the adjacency matrix into a target graph neural network model to determine an output value of the target graph neural network model;
and step S90, determining the target dormancy duration of the current crawling action according to the output value.
In this embodiment, after crawling the web content of the target web address through the target IP address, the terminal device may obtain a context feature and a spatial feature of the web content, where the context feature of the web content is used to represent whether the web content is easy to read, and the spatial feature of the web content is used to represent a parent-child relationship between pages included in the web content.
After the terminal device obtains the context feature and the spatial feature of the web page content, the context feature of the web page content can be converted into a corresponding feature vector, and the spatial feature of the web page content can be converted into a corresponding adjacency matrix.
After the terminal equipment acquires the feature vector corresponding to the context feature and the adjacent matrix corresponding to the space feature, the feature vector corresponding to the context feature and the adjacent matrix corresponding to the space feature are input into the target graph neural network model to determine the output value of the target graph neural network model. And determining the target sleep duration of the current crawling action according to the output value of the target graph neural network model, wherein the target sleep duration of the current crawling action refers to the action time delay of the current crawling action.
Optionally, a corrected value of the sleep duration of the current crawling action is determined according to the output value of the target graph neural network model, meanwhile, a preset sleep duration of the current crawling action is obtained, and the preset sleep duration is corrected by the corrected value to obtain a target sleep duration of the current crawling action. The preset sleep duration may be set according to actual needs, which is not limited in this embodiment.
In the technical scheme provided by this embodiment, a feature vector corresponding to a context feature is generated by obtaining the context feature and a spatial feature of web page content, an adjacency matrix corresponding to the spatial feature is generated, the feature vector and the adjacency matrix are input into a target graph neural network model to determine an output value of the target graph neural network model, and a target sleep duration of a current crawling action is determined according to the output value. According to the scheme, an action time delay can be set for the current crawling action through the target graph neural network model and the characteristics of the crawled webpage content, the reading and searching actions of human beings can be simulated, the anti-crawling detection measures can be effectively avoided, and the acquisition efficiency of the webpage content is further improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a method for acquiring web content according to a third embodiment of the present invention, where based on the second embodiment, before the step of S10, the method further includes:
s100, acquiring a sample data set, wherein the sample data set comprises a training set and a test set;
step S200, training a neural network model of a preset graph by adopting the training set;
step S300, testing the trained preset graph neural network model by using the test set to determine an output value of the trained preset graph neural network model;
and step S400, when the output value of the trained preset map neural network model is in a preset range, determining the trained preset map neural network model as the target map neural network model.
In this embodiment, the terminal device may obtain a sample data set, where the sample data set includes a training set and a test set, and a ratio of the training set to the test set may be selected to be 9:1, that is, 90% of the sample data set is used as the training set and 10% is used as the test set.
Optionally, the terminal device may obtain the contextual characteristics and the spatial characteristics of the labeled web content, and determine the sample data set according to the contextual characteristics and the spatial characteristics of the labeled web content. Specifically, the terminal device may obtain the labeled web page content, perform word segmentation on the labeled web page content, segment the labeled web page content into words or phrases, and detect and extract chinese, english, and special characters therein to obtain contextual characteristics of the labeled web page content; meanwhile, the terminal device can analyze the parent-child relationship of the webpage contained in the labeled webpage content to obtain the spatial characteristics of the labeled webpage content. The terminal equipment can convert the context characteristics of the labeled webpage content into a characteristic vector, convert the spatial characteristics of the labeled webpage content into an adjacent matrix, divide the characteristic vector and the adjacent matrix into a training set and a test set according to a 9:1 distribution mode, and determine the obtained training set and the test set as a sample data set.
And after the terminal equipment acquires the sample data set, training a preset graph neural network model by adopting a training set to obtain the trained graph neural network model. The preset graph neural network model is composed of a convolutional layer, a pooling layer and a full-link layer, wherein the number of network layers, layer _ gnn, can be selected to be 3, an activation function can be selected to be a ReLU activation function, the full-link layer can be realized through a softmax function, the exponential decay learning rate can be selected to be 1e-3, the learning rate decay rate can be selected to be 0.5, the iteration number can be selected to be 100, and the decay is performed once every 10 rounds.
After the terminal equipment obtains the trained graph neural network model, testing the trained graph neural network model by using a test set to determine an output value of the trained graph neural network model, then judging whether the output value of the trained graph neural network model is in a preset range, determining the trained graph neural network model as a target graph neural network model when the output value of the trained graph neural network model is in the preset range, wherein the training parameters of the preset graph neural network model are adjusted when the output value of the trained graph neural network model is not in the preset range, training the preset graph neural network model by using the training set again, and repeating the process until the target graph neural network model is obtained. The preset range may be set according to actual needs, and this embodiment does not limit this.
In the technical scheme provided by this embodiment, a sample data set is obtained, where the sample data set includes a training set and a test set, the training set is used to train the preset-diagram neural network model, the test set is used to test the trained preset-diagram neural network model so as to determine an output value of the trained preset-diagram neural network model, and when the output value of the trained preset-diagram neural network model is within a preset range, the trained preset-diagram neural network model is determined as the target-diagram neural network model. According to the scheme, the target graph neural network model is obtained through training, the identification precision of the target graph neural network model can be improved, the dormancy duration of the current crawling action determined through the target graph neural network model is more accurate, and the webpage content acquisition efficiency is further improved.
Based on the foregoing embodiments, the present invention further provides an apparatus for acquiring web content, where the apparatus for acquiring web content may include a memory, a processor, and a web content acquisition program stored in the memory and executable on the processor, and when the processor executes the web content acquisition program, the steps of the method for acquiring web content according to any of the foregoing embodiments are implemented.
Based on the foregoing embodiments, the present invention further provides a readable storage medium, on which an acquiring program of web content is stored, where the acquiring program of web content implements the steps of the acquiring method of web content according to any one of the foregoing embodiments when executed by a processor.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a smart tv, a mobile phone, a computer, etc.) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1.一种网页内容的获取方法,其特征在于,所述网页内容的获取方法包括:1. A method for acquiring web content, wherein the method for acquiring web content comprises: 绑定至少两个IP地址;Bind at least two IP addresses; 从网页地址队列中获取目标网页地址,其中,所述网页地址队列中存放的待爬取网页地址的数量小于或等于绑定的所述IP地址的数量;Obtain the target webpage address from the webpage address queue, wherein the number of the webpage addresses to be crawled stored in the webpage address queue is less than or equal to the number of the bound IP addresses; 在绑定的所述IP地址中获取所述目标网页地址对应的目标IP地址;Obtain the target IP address corresponding to the target webpage address from the bound IP address; 通过所述目标IP地址爬取所述目标网页地址对应的网页内容。Crawl the webpage content corresponding to the target webpage address by using the target IP address. 2.如权利要求1所述的网页内容的获取方法,其特征在于,绑定的所述IP地址为在同一子网下的IPv6地址。2 . The method for obtaining webpage content according to claim 1 , wherein the bound IP address is an IPv6 address under the same subnet. 3 . 3.如权利要求1所述的网页内容的获取方法,其特征在于,所述通过所述目标IP地址爬取所述目标网页地址对应的网页内容的步骤之前,还包括:3. The method for obtaining webpage content as claimed in claim 1, wherein before the step of crawling the webpage content corresponding to the target webpage address by the target IP address, the method further comprises: 获取浏览器测试框架Selenium;Get the browser testing framework Selenium; 所述通过所述目标IP地址爬取所述目标网页地址对应的网页内容的步骤包括:The step of crawling the webpage content corresponding to the target webpage address through the target IP address includes: 通过所述Selenium以及所述目标IP地址爬取所述目标网页地址对应的网页内容。Crawl the webpage content corresponding to the target webpage address by using the Selenium and the target IP address. 4.如权利要求1所述的网页内容的获取方法,其特征在于,所述通过所述目标IP地址爬取所述目标网页地址对应的网页内容的步骤之后,还包括:4. The method for obtaining webpage content according to claim 1, wherein after the step of crawling the webpage content corresponding to the target webpage address by the target IP address, the method further comprises: 获取所述网页内容的上下文特征以及空间特征;Obtain the contextual features and spatial features of the webpage content; 生成所述上下文特征对应的特征向量;generating a feature vector corresponding to the context feature; 生成所述空间特征对应的邻接矩阵;generating an adjacency matrix corresponding to the spatial feature; 将所述特征向量以及所述邻接矩阵输入目标图神经网络模型,以确定所述目标图神经网络模型的输出值;Inputting the feature vector and the adjacency matrix into the target graph neural network model to determine the output value of the target graph neural network model; 根据所述输出值确定当前爬取动作的目标休眠时长。The target sleep duration of the current crawling action is determined according to the output value. 5.如权利要求4所述的网页内容的获取方法,其特征在于,所述根据所述输出值确定当前爬取动作的目标休眠时长的步骤包括:5. The method for obtaining webpage content according to claim 4, wherein the step of determining the target sleep duration of the current crawling action according to the output value comprises: 根据所述输出值确定当前爬取动作的休眠时长的修正值;Determine the correction value of the sleep duration of the current crawling action according to the output value; 获取预设休眠时长;Get the preset sleep duration; 采用所述修正值修正所述预设休眠时长得到所述目标休眠时长。Using the correction value to correct the preset sleep duration to obtain the target sleep duration. 6.如权利要求4所述的网页内容的获取方法,其特征在于,所述绑定至少两个IP地址的步骤之前,还包括:6. The method for obtaining webpage content according to claim 4, wherein before the step of binding at least two IP addresses, the method further comprises: 获取样本数据集,其中,所述样本数据集包括训练集以及测试集;Obtain a sample data set, wherein the sample data set includes a training set and a test set; 采用所述训练集训练预设图神经网络模型;Use the training set to train a preset graph neural network model; 采用所述测试集测试训练后的所述预设图神经网络模型,以确定训练后的所述预设图神经网络模型的输出值;Using the test set to test the trained preset graph neural network model to determine the output value of the trained preset graph neural network model; 在训练后的所述预设图神经网络模型的输出值处于预设范围时,将训练后的所述预设图神经网络模型确定为所述目标图神经网络模型。When the output value of the trained preset graph neural network model is within a preset range, the trained preset graph neural network model is determined as the target graph neural network model. 7.如权利要求6所述的网页内容的获取方法,其特征在于,所述获取样本数据集的步骤包括:7. The method for obtaining webpage content according to claim 6, wherein the step of obtaining a sample data set comprises: 获取已标注的网页内容的上下文特征以及空间特征;Obtain the contextual features and spatial features of the marked web page content; 根据已标注的所述网页内容的上下文特征以及空间特征确定所述样本数据集。The sample data set is determined according to the marked context features and spatial features of the webpage content. 8.如权利要求1所述的网页内容的获取方法,其特征在于,所述从网页地址队列中获取目标网页地址的步骤之前,还包括:8. The method for obtaining webpage content according to claim 1, wherein before the step of obtaining the target webpage address from the webpage address queue, the method further comprises: 初始化网页地址队列;Initialize the web page address queue; 将所述待爬取网页地址添加至所述网页地址队列。adding the address of the webpage to be crawled to the webpage address queue. 9.一种终端设备,其特征在于,所述终端设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的网页内容的获取程序,所述网页内容的获取程序被所述处理器执行时实现如权利要求1-8中任一项所述的网页内容的获取方法的步骤。9. A terminal device, characterized in that the terminal device comprises a memory, a processor, and a program for acquiring web page content that is stored on the memory and can be run on the processor, and the program that acquires the web page content When executed by the processor, the steps of implementing the method for obtaining webpage content according to any one of claims 1-8. 10.一种可读存储介质,其特征在于,所述可读存储介质上存储有网页内容的获取程序,所述网页内容的获取程序被处理器执行时实现如权利要求1-8中任一项所述的网页内容的获取方法的步骤。10. A readable storage medium, wherein the readable storage medium stores a program for acquiring web page content, and when the program for acquiring web page content is executed by a processor, any one of claims 1-8 is implemented. The steps of the method for acquiring webpage content described in item 1.
CN202111007979.1A 2021-08-30 2021-08-30 Webpage content acquisition method, terminal equipment and readable storage medium Active CN113821705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111007979.1A CN113821705B (en) 2021-08-30 2021-08-30 Webpage content acquisition method, terminal equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111007979.1A CN113821705B (en) 2021-08-30 2021-08-30 Webpage content acquisition method, terminal equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113821705A true CN113821705A (en) 2021-12-21
CN113821705B CN113821705B (en) 2024-02-20

Family

ID=78923557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111007979.1A Active CN113821705B (en) 2021-08-30 2021-08-30 Webpage content acquisition method, terminal equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113821705B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008046098A2 (en) * 2006-10-13 2008-04-17 Move, Inc. Multi-tiered cascading crawling system
US20090204610A1 (en) * 2008-02-11 2009-08-13 Hellstrom Benjamin J Deep web miner
US7987172B1 (en) * 2004-08-30 2011-07-26 Google Inc. Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
US20140067854A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Crawling of generated server-side content
US20160275190A1 (en) * 2013-10-21 2016-09-22 Convida Wireless, Llc Crawling of m2m devices
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN108897788A (en) * 2018-06-11 2018-11-27 平安科技(深圳)有限公司 Data crawling method, device, computer equipment and storage medium
CN109063216A (en) * 2018-10-17 2018-12-21 珠海市智图数研信息技术有限公司 A kind of distributed vertical service search crawler frame
CN109413050A (en) * 2018-10-05 2019-03-01 国网湖南省电力有限公司 A kind of internet vulnerability information acquisition method that access rate is adaptive and system
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN111064745A (en) * 2019-12-30 2020-04-24 厦门市美亚柏科信息股份有限公司 Self-adaptive back-climbing method and system based on abnormal behavior detection
CN111104578A (en) * 2019-12-18 2020-05-05 北京阿尔山区块链联盟科技有限公司 Crawler system, method and server
CN111651656A (en) * 2020-06-02 2020-09-11 重庆邮电大学 A dynamic web crawler method and system based on foundry mode
CN111858929A (en) * 2020-06-22 2020-10-30 网宿科技股份有限公司 A network crawler detection method, system and device based on graph neural network
CN112100472A (en) * 2020-09-11 2020-12-18 深圳市科盾科技有限公司 Crawler scheduling method and device, terminal equipment and readable storage medium
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium
CN112989158A (en) * 2019-12-16 2021-06-18 顺丰科技有限公司 Method, device and storage medium for identifying webpage crawler behavior

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7987172B1 (en) * 2004-08-30 2011-07-26 Google Inc. Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
WO2008046098A2 (en) * 2006-10-13 2008-04-17 Move, Inc. Multi-tiered cascading crawling system
US20090204610A1 (en) * 2008-02-11 2009-08-13 Hellstrom Benjamin J Deep web miner
CN103389983A (en) * 2012-05-08 2013-11-13 阿里巴巴集团控股有限公司 Webpage content grabbing method and device applied to network crawler system
US20140067854A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Crawling of generated server-side content
US20160275190A1 (en) * 2013-10-21 2016-09-22 Convida Wireless, Llc Crawling of m2m devices
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN108897788A (en) * 2018-06-11 2018-11-27 平安科技(深圳)有限公司 Data crawling method, device, computer equipment and storage medium
CN109413050A (en) * 2018-10-05 2019-03-01 国网湖南省电力有限公司 A kind of internet vulnerability information acquisition method that access rate is adaptive and system
CN109063216A (en) * 2018-10-17 2018-12-21 珠海市智图数研信息技术有限公司 A kind of distributed vertical service search crawler frame
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN112989158A (en) * 2019-12-16 2021-06-18 顺丰科技有限公司 Method, device and storage medium for identifying webpage crawler behavior
CN111104578A (en) * 2019-12-18 2020-05-05 北京阿尔山区块链联盟科技有限公司 Crawler system, method and server
CN111064745A (en) * 2019-12-30 2020-04-24 厦门市美亚柏科信息股份有限公司 Self-adaptive back-climbing method and system based on abnormal behavior detection
CN111651656A (en) * 2020-06-02 2020-09-11 重庆邮电大学 A dynamic web crawler method and system based on foundry mode
CN111858929A (en) * 2020-06-22 2020-10-30 网宿科技股份有限公司 A network crawler detection method, system and device based on graph neural network
CN112100472A (en) * 2020-09-11 2020-12-18 深圳市科盾科技有限公司 Crawler scheduling method and device, terminal equipment and readable storage medium
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
兰秋军;: "互联网金融数据抓取方法研究", 计算机工程与设计, no. 05 *
唐雪峰;宋俊德;宋美娜;: "基于改进的慢开始算法的网络机器人爬取策略的研究", 新型工业化, no. 11, 20 November 2012 (2012-11-20) *
高晖: "面向Web2.0社区的爬虫关键技术研究", 中国优秀硕士学位论文全文数据库, 29 July 2011 (2011-07-29) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium
CN116719986B (en) * 2023-08-10 2023-12-26 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113821705B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN105630800B (en) Method and system for ordering node importance
US20120143844A1 (en) Multi-level coverage for crawling selection
CN108008936B (en) Data processing method and device and electronic equipment
CN112287965A (en) Image quality detection model training method and device and computer equipment
CN111723400A (en) JS sensitive information leakage detection method, device, equipment and medium
Arias An event model for phylogenetic biogeography using explicitly geographical ranges
CN115989489A (en) Concept prediction for automatically creating new intents and assigning examples in a dialog system
CN113821705A (en) Webpage content acquisition method, terminal equipment and readable storage medium
CN111008873B (en) User determination method, device, electronic equipment and storage medium
CN106775611B (en) Method for realizing self-adaptive dynamic web page crawler system based on machine learning
CN106874340B (en) Webpage address classification method and device
Van Der Grinten et al. Scalable katz ranking computation in large static and dynamic graphs
CN110929185A (en) Website directory detection method and device, computer equipment and computer storage medium
CN109284590A (en) Access method, equipment, storage medium and the device of behavior safety protection
CN106126670B (en) Operation data sorting processing method and device
CN113160042A (en) Image style migration model training method and device and electronic equipment
CN117806975A (en) Test method, test device, test equipment and storage medium
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN110275974A (en) Data adding method, device, terminal device and storage medium for sample data set
CN113626483B (en) Front-end caching method, system, equipment and storage medium for filling forms
CN113312549B (en) Domain name processing method and device
CN108920492A (en) A kind of Web page classification method, system, terminal and storage medium
CN114065092A (en) Website identification method, device, computer equipment and storage medium
CN114510592A (en) Image classification method and device, electronic equipment and storage medium
CN111880773A (en) A data processing method, device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant