CN106982268B - Information processing method and server - Google Patents
Information processing method and server Download PDFInfo
- Publication number
- CN106982268B CN106982268B CN201610031134.9A CN201610031134A CN106982268B CN 106982268 B CN106982268 B CN 106982268B CN 201610031134 A CN201610031134 A CN 201610031134A CN 106982268 B CN106982268 B CN 106982268B
- Authority
- CN
- China
- Prior art keywords
- task
- information
- domain name
- token
- pool
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 22
- 238000003672 processing method Methods 0.000 title claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 95
- 238000013481 data capture Methods 0.000 claims abstract description 31
- 238000000034 method Methods 0.000 claims description 35
- 238000004458 analytical method Methods 0.000 claims description 13
- 239000007858 starting material Substances 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 25
- 230000006870 function Effects 0.000 description 24
- 239000000284 extract Substances 0.000 description 16
- 238000010586 diagram Methods 0.000 description 12
- 235000014510 cooky Nutrition 0.000 description 5
- 238000012216 screening Methods 0.000 description 5
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000003999 initiator Substances 0.000 description 3
- 239000002699 waste material Substances 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 239000000969 carrier Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/45—Network directories; Name-to-address mapping
- H04L61/4505—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
- H04L61/4511—Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/50—Queue scheduling
- H04L47/52—Queue scheduling by attributing bandwidth to queues
- H04L47/527—Quantum based scheduling, e.g. credit or deficit based scheduling or token bank
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/50—Address allocation
- H04L61/5061—Pools of addresses
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The embodiment of the invention discloses an information processing method and a server, which comprise the following steps: adding task information to be subjected to data capture into a task pool; extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool; determining a capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool; scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capturing rate, and sending the tokens of corresponding number to a scheduling queue when the number of tokens meets a preset condition; obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue; and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information.
Description
Technical Field
The invention relates to an information processing technology, in particular to an information processing method and a server.
Background
With the rapid development of internet technology, web pages become carriers of massive information. At present, one of the ways to extract information from a web page is a web crawler, and specifically, a script configured for extracting web page data is used to extract specified web page content.
In the process of implementing the technical solution of the embodiment of the present application, the inventor of the present application finds at least the following technical problems in the related art:
1. the method comprises the steps that the capturing rate and the capturing frequency of webpage data are pre-configured in a script program, and automatic adjustment cannot be realized according to the number of tasks in the data capturing process; 2. the capture rate is set based on the configured item information, different items may configure the same domain name, which may result in the capture rate of the system under the same domain name being too fast, thereby resulting in the proxy Internet Protocol (IP) being sealed; however, no effective solution to the above problems exists in the related art.
Disclosure of Invention
In order to solve the existing technical problem, embodiments of the present invention provide an information processing method and a server, which can implement automatic adjustment of a capture rate and avoid proxy IP from being sealed.
In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
the embodiment of the invention provides an information processing method, which comprises the following steps:
adding task information to be subjected to data capture into a task pool;
extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool;
determining a capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool;
scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capturing rate, and sending the tokens of corresponding number to a scheduling queue when the number of tokens meets a preset condition;
obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue;
and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information.
In the foregoing solution, the determining, according to the number of schedulable task information under the domain name information in the token pool, a capture rate corresponding to the domain name information includes:
according to the domain name information of the tokens in the token pool, counting the quantity of task information which corresponds to the domain name information and meets a third preset condition in the task pool; wherein the third preset condition comprises: the state of the task information is schedulable, and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time;
the capture rate corresponding to the domain name information satisfies the following expression:
wherein n represents the number of task information; x and Y are both positive integers.
In the above scheme, after the task information to be subjected to data capture is added to the task pool, the method further includes:
extracting second task information from the task pool; the second task information represents a list page task;
counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks;
updating the initial grabbing frequency based on the grabbing frequency.
In the above scheme, the capture frequency age satisfies the following expression:
wherein n represents the number of subtasks within a preset time range counted based on the second task information; t is1、T2And T3Are all positive integers.
In the foregoing scheme, the determining the number of tokens corresponding to the domain name information according to the capture rate includes:
setting rate to represent the corresponding capture rate of the domain name information, lasttime to represent the last token generation time, nowtime to represent the current time, and remainder to represent the number of the tokens left when the tokens are generated; the token number N corresponding to the domain name information satisfies the following expression:
N=rate×(nowtime-lasttime)+remainder。
in the foregoing scheme, when the number of tokens satisfies a preset condition, sending a corresponding number of tokens to a scheduling queue includes:
when the number N of the tokens is more than or equal to 1, obtaining an integer part of the number N of the tokens, and sending the tokens meeting the number of the integer part to a scheduling queue; assigning a fractional part of the token number N to the remainder;
and when the token number N is less than 1, directly assigning the token number N to the remainder.
In the above scheme, the number of the scheduling queues is multiple; sending the corresponding number of tokens to a scheduling queue includes:
processing the domain name information according to a preset processing mode; sending tokens with the number corresponding to the domain name information to a first scheduling queue corresponding to a processing result; the first scheduling queue is one of a plurality of scheduling queues.
In the above scheme, the token includes: domain name information and domain name Internet Protocol (IP); the method further comprises the following steps: analyzing the domain name information in the token pool according to a preset period to obtain a first domain name IP corresponding to the domain name information;
comparing the first domain name IP obtained by analysis with the domain name IPs in the token pool;
when the first domain name IP is in the token pool, updating the resolution time of the domain name IP corresponding to the first domain name IP in the token pool;
when the first domain name IP is not in the token pool, adding the first domain name IP to the token pool;
and when the second domain name IP in the token pool is not in the first domain name IP obtained by analysis, deleting the second domain name IP from the token pool.
In the foregoing scheme, the selecting, in the task pool, first task information that meets a second preset condition according to domain name information corresponding to the token includes:
selecting first task information meeting the following conditions in the task pool according to the domain name information corresponding to the token:
matching the domain name corresponding to the task information with the domain name information of the token;
and the status of the task information is schedulable;
and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
In the foregoing solution, the adding the first task information to a processing queue includes:
and sequencing the first task information meeting the second preset condition according to the priority and/or the hierarchy, and preferentially adding the task information meeting the high priority and/or the low hierarchy into a processing queue.
In the above scheme, adding the task information to be subjected to data capture to the task pool includes:
configuring a grabbing item, generating corresponding item information based on the configured grabbing item, and sending the item information to an item pool for storage; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
scanning project information in an uninstalled state in the project pool, and detecting whether a task in the project information is in the task pool;
and when the task is not in the task pool, adding the task information of the task into the task pool.
An embodiment of the present invention further provides a server, where the server includes: the system comprises an item starter, a token pool, a task pool, a rate controller, a token generator, a scheduling queue, a scheduler, a processing queue and a processor; wherein,
the project starter is used for adding task information to be subjected to data capture to a task pool; extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool;
the task pool is used for storing task information;
the token pool is used for storing tokens;
the rate controller is used for determining the capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool;
the token generator is used for scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capture rate determined by the rate controller, and sending the tokens of corresponding number to a scheduling queue when the number of tokens meets a preset condition;
the scheduler is used for obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue;
the processor is used for extracting the first task information from the processing queue and executing the capture of corresponding data according to the first task information.
In the above scheme, the rate controller is configured to count, according to domain name information of tokens in the token pool, the number of task information that corresponds to the domain name information and satisfies a third preset condition in the task pool; wherein the third preset condition comprises: the state of the task information is schedulable, and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time;
the capture rate corresponding to the domain name information satisfies the following expression:
wherein n represents the number of task information; x and Y are both positive integers.
In the above solution, the server further includes a frequency controller, configured to extract second task information from the task pool; the second task information represents a list page task; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; updating the initial grabbing frequency based on the grabbing frequency.
In the above scheme, the capture frequency age corresponding to the second task information determined by the frequency controller satisfies the following expression:
wherein n represents the number of subtasks within a preset time range counted based on the second task information; t is1、T2And T3Are all positive integers.
In the above scheme, the determining, by the token generator, the number of tokens corresponding to the domain name information according to the capture rate includes: setting rate to represent the corresponding capture rate of the domain name information, lasttime to represent the last token generation time, nowtime to represent the current time, and remainder to represent the number of the tokens left when the tokens are generated; the token number N corresponding to the domain name information satisfies the following expression: n ═ rate x (nowtime-lasttime) + remainder.
In the above scheme, the token generator is configured to, when the number N of tokens is greater than or equal to 1, obtain an integer part of the number N of tokens, and send tokens satisfying the number of the integer part to a scheduling queue; assigning a fractional part of the token number N to the remainder; and when the token number N is less than 1, directly assigning the token number N to the remainder.
In the above scheme, the number of the scheduling queues is multiple; the number of the schedulers is multiple; the plurality of schedulers correspond to the plurality of scheduling queues one to one;
the token generator is used for processing the domain name information according to a preset processing mode; sending the token corresponding to the domain name information to a first scheduling queue corresponding to a processing result; the first scheduling queue is one of a plurality of scheduling queues.
In the above scheme, the token includes: domain name information and domain name IP; the server further comprises a domain name resolver, which is used for resolving the domain name information in the token pool according to a preset period to obtain a first domain name IP corresponding to the domain name information; comparing the first domain name IP obtained by analysis with the domain name IPs in the token pool; when the first domain name IP is in the token pool, updating the resolution time of the domain name IP corresponding to the first domain name IP in the token pool; when the first domain name IP is not in the token pool, adding the first domain name IP to the token pool; and when the second domain name IP in the token pool is not in the first domain name IP obtained by analysis, deleting the second domain name IP from the token pool.
In the foregoing solution, the scheduler is configured to select, in the task pool, first task information that satisfies the following conditions according to domain name information corresponding to the token: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
In the foregoing solution, the scheduler is configured to sort the plurality of first task information meeting the second preset condition according to priority and/or hierarchy, and preferentially add the task information meeting the high priority and/or the low hierarchy to the processing queue.
In the above scheme, the server further includes a configuration unit and a project pool; wherein,
the configuration unit is used for configuring the grabbing items, generating corresponding item information based on the configured grabbing items, and sending the item information to the item pool;
the project pool is used for storing project information; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
the project starter is used for scanning project information in an uninstalled state in the project pool and detecting whether a task in the project information is in the task pool; and when the task is not in the task pool any more, adding the task information of the task into the task pool.
The embodiment of the invention provides an information processing method and a server, wherein the method comprises the following steps: adding task information to be subjected to data capture into a task pool; the task information includes: the method comprises the following steps of (1) address information, task identification and initial grabbing frequency of a task; extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP and proxy IP; determining a capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool; scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capturing rate, and sending the tokens of corresponding number to a scheduling queue when the number of tokens meets a preset condition; obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue; and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information. Therefore, by adopting the technical scheme of the embodiment of the invention, the capturing speed corresponding to the domain name information is determined according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced.
Drawings
Fig. 1 is a schematic diagram of a first component structure of a server according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a second structure of a server according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an implementation scenario of a server according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a third component structure of a server according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a fourth component structure of a server according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a fifth structure of the server according to the embodiment of the present invention;
FIG. 7 is a flowchart illustrating a first information processing method according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating a second information processing method according to an embodiment of the present invention;
fig. 9 is a flowchart illustrating a third information processing method according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
The embodiment of the invention provides an information processing method. Fig. 1 is a schematic structural diagram of a server according to a first embodiment of the present invention; as shown in fig. 1, the server includes: an item initiator 101, a token pool 102, a task pool 103, a rate controller 104, a token generator 105, a scheduling queue 106, a scheduler 107, a processing queue 108, and a processor 109; wherein,
the project launcher 101 is configured to add task information to be subjected to data capture to the task pool 103; the task information includes: the method comprises the following steps of (1) address information, task identification and initial grabbing frequency of a task; extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool 102; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP and proxy IP;
the task pool 103 is used for storing task information;
the token pool 102 is used for storing tokens;
the rate controller 104 is configured to determine a capturing rate corresponding to the domain name information according to the number of schedulable task information under the domain name information in the token pool 102;
the token generator 105 is configured to scan the token pool 102, determine the number of tokens corresponding to the domain name information according to the capture rate determined by the rate controller 104, and send the corresponding number of tokens to the scheduling queue 106 when the number of tokens meets a preset condition;
the scheduler 107 is configured to obtain a token from the scheduling queue 106, select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token, and add the first task information to the processing queue 108;
the processor 109 is configured to extract the first task information from the processing queue 108, and execute fetching of corresponding data according to the first task information.
FIG. 2 is a diagram illustrating a second structure of a server according to an embodiment of the present invention; as shown in fig. 2, in this embodiment, the server further includes a configuration unit 110 and an item pool 111; wherein,
the configuration unit 110 is configured to configure a grab item, generate corresponding item information based on the configured grab item, and send the item information to the item pool 111;
the project pool 111 is used for storing project information; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
the project launcher 101 is configured to scan project information in an uninstalled state in the project pool 111, and detect whether a task in the project information is in the task pool 103; when the task is not in the task pool 103, adding task information of the task to the task pool 103.
Specifically, the operator configures the project content through the configuration unit 110, where the configured project content includes a project script program; the project script program comprises the address information of the webpage to be captured; the address information is, for example, a Uniform Resource Locator (URL); and generating project information based on the configured project content, and sending the project information to the project pool 111. Fig. 3 is a schematic diagram of an implementation scenario of a server according to an embodiment of the present invention; as shown in FIG. 3, an operator may configure data capture rules in the configuration interface shown in FIG. 3; and after the configuration is completed, generating a project script program.
The project launcher 101 is specifically configured to scan the project information stored in the project pool 111 and in an uninstalled state, and detect whether a task in the project information is stored in the task pool 103; when the grabbing task is not stored in the task pool 103, adding task information corresponding to the task pool 103; the task information includes: the address information of the task, the task identification and the initial grabbing frequency. The task information may further include: callback functions and priorities, etc.
Specifically, the project launcher 101 may detect whether the address information exists in the task pool 103 through address information (specifically, a URL) configured in the project script program in the process of executing the Onstart () function by executing the Onstart () function of the project script program in the project information; when it is determined that the address information does not exist in the task pool 103, a task identifier (i.e., a task ID) corresponding to the address information and a preconfigured initial grabbing frequency are automatically generated, and the address information, the task identifier, and the initial grabbing frequency are added to the task pool 103 as task information.
Further, the project launcher 101 is further configured to extract domain name information of the task information; detecting whether the domain name information is in the token pool 102; when the domain name information is not in the token pool 102, generating token information according to the domain name information, and adding the token information to the token pool 102; and the token information is generated by taking the domain name information as a keyword and taking the initial capture rate, the last token generation time, the domain name IP and the proxy IP as keyword values.
In this embodiment of the present invention, the Token may be specifically denoted as Token; which characterizes a data structure; accordingly, the Token generator 105 may also be referred to as a Token generator. Specifically, the project launcher 101 extracts domain name information corresponding to the task information (specifically, URL), and determines whether the domain name information exists in the token pool 102; when the domain name information is not in the token pool 102, generating a token by using the domain name as a key (the key may be represented as a key), and using a preconfigured initial capture rate, a last token generation time, a domain name IP, and a proxy IP as key values (the key values may be represented as values); the Domain Name IP can be obtained by performing Domain Name System (DNS) resolution on the Domain Name information; the proxy IP is preset in the project script program. Further, the status of the project information in the project pool 111 is updated to an installed status.
In this embodiment, the rate controller 104 is specifically configured to count, according to domain name information of the tokens in the token pool 102, the number of pieces of task information that correspond to the domain name information and satisfy a third preset condition in the task pool 103; wherein the third preset condition comprises: the state of the task information is schedulable, and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time;
the capture rate corresponding to the domain name information satisfies the following formula (1):
wherein n represents the number of task information; x and Y are both positive integers.
Specifically, the rate controller 104 extracts domain name information from the token pool 102, and counts the number of tasks whose task states are schedulable and whose sum of the last scheduling time of the task and the capturing frequency of the task is less than the current time under the domain name information, where the number of tasks is denoted as n; calculating the corresponding capture rate of the domain name information according to the formula (1); the capture rate represents the capture rate of the webpage data corresponding to the domain name information.
Preferably, X is 360 and Y is 3600, i.e. formula (1) can be expressed as:
in equation (2), 3600 indicates that 3600 seconds are available for 1 hour, and the grab rate indicates the amount of tasks processed per second + 0.1. Of course, the formula (2) is only an example of the grabbing rate, where X and Y may also be any positive integer, and this embodiment is not particularly limited.
Further, after the rate controller 104 determines the capture rate, the initial capture rate of the corresponding domain name information in the token pool 102 is updated according to the capture rate.
In this embodiment, the token generator 105 is configured to scan the token pool 102, and determine a number of tokens corresponding to the domain name information according to the fetching rate determined by the rate controller 104, where the number of tokens N satisfies the following formula (3):
N=rate×(nowtime-lasttime)+remainder (3)
wherein, rate represents the capture rate corresponding to the domain name information, lasttime represents the last token generation time, nowtime represents the current time, and remainder represents the number of tokens left when the tokens are generated.
Further, the token generator 105 is configured to, when the number N of tokens is greater than or equal to 1, obtain an integer part of the number N of tokens, and send tokens that satisfy the number of the integer part to the scheduling queue 106; assigning a fractional part of the token number N to the remainder; and when the token number N is less than 1, directly assigning the token number N to the remainder.
Specifically, the token generator 105 scans the token pool 102, and calculates the number of tokens correspondingly generated under each domain name information according to formula (3). When the obtained number N of tokens is greater than or equal to 1, obtaining an integer part of the number N of tokens, for example, the integer part is M, and sending M tokens to the scheduling queue 106; assigning the decimal part of the token number N to a remainder; and when the obtained token number N is less than 1, directly assigning the token number N to the remainder. After completion, nowtime is assigned to lasttime, and the last token generation time is updated.
In this embodiment, the scheduler 107 is configured to select, in the task pool 103, first task information that satisfies the following conditions according to domain name information corresponding to the token: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
Further, the scheduler 107 is configured to sort the plurality of first task information meeting the second preset condition according to priority and/or hierarchy, and preferentially add the task information meeting the high priority and/or the low hierarchy to the processing queue 108.
Specifically, the scheduler 107 extracts a token from the scheduling queue 106, and selects one piece of task information from the task pool 103 to perform scheduling according to domain name information carried in the token; wherein, the selected task information simultaneously satisfies the following three conditions: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and screening to obtain the first task information, wherein the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time. Further, sorting is performed according to priority information and/or hierarchy information carried in the first task information, for example, sorting is performed according to the priority information from high to low, and sorting is performed according to the hierarchy information from low to high; selecting the task with highest priority and/or lowest hierarchy to be added to the processing queue 108 preferentially; specifically, the selected task identifier (for example, task ID) and the proxy IP carried in the token corresponding to the task information are added to the processing queue 108. Wherein the hierarchy information can be understood as: the task of capturing the home page of the webpage can be recorded as level 1; the task of capturing the sub-page of the home page of the webpage is recorded as a level 2; and so on.
In this embodiment, specifically, the processor 109 extracts a task identifier (for example, a task ID) and an agent IP from the processing queue 108, and obtains address information (for example, a URL), a callback function, source address information (refer) of the address information, user agent information (user agent), a cookie, and other information associated with the task identifier from the task pool 103 according to the task identifier (for example, the task ID). The refer identifies the address information of the last page corresponding to the address information, and can be understood as the source address information of the address information; the user agent information (user agent) may specifically be browser information, including information such as a hardware platform, system software, and application software. Further, the processor 109 captures corresponding web page data according to the information, analyzes the captured web page data according to the callback function, and stores the analyzed specific content (the specific content, such as news, articles, and the like) into a result pool; storing the analyzed new task into the task pool 103; when the task processed by the processor 109 is a detail page task, data capture is performed to obtain specific content, and correspondingly, the state of the task is updated to be processed, so that the corresponding task cannot be scheduled again; when the task processed by the processor 109 is a list page task, a new task is obtained after data capture, and correspondingly, the state of the task is updated to be in waiting scheduling, so that the corresponding task is scheduled again when the time is up. The list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; the detail page task is a task type, and the web page obtained after the data capture of the type is executed is a specific content (such as news, articles and other contents) web page. And further, updating the processing time of the task to be the current time.
By adopting the technical scheme of the embodiment of the invention, the capturing speed corresponding to the domain name information is determined by the speed controller according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced.
Example two
The embodiment of the invention also provides a server. FIG. 4 is a diagram illustrating a third component structure of a server according to an embodiment of the present invention; as shown in fig. 4, the server includes: a configuration unit 110, an item pool 111, an item launcher 101, a token pool 102, a task pool 103, a rate controller 104, a frequency controller 112, a token generator 105, a scheduling queue 106, a scheduler 107, a processing queue 108, and a processor 109; wherein,
the configuration unit 110 is configured to configure a grab item, generate corresponding item information based on the configured grab item, and send the item information to the item pool 111;
the project pool 111 is used for storing project information; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
the project launcher 101 is configured to scan project information in an uninstalled state in the project pool 111, and detect whether a task in the project information is in the task pool 103; when the task is not in the task pool 103, adding task information of the task to the task pool 103; the task information includes: the method comprises the following steps of (1) address information, task identification and initial grabbing frequency of a task; further configured to detect whether the domain name information is in the token pool 102; when the domain name information is not in the token pool 102, generating a token according to the domain name information, and adding the token into the token pool 102; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP and proxy IP;
the task pool 103 is used for storing task information;
the token pool 102 is used for storing tokens;
the rate controller 104 is configured to determine a capturing rate corresponding to the domain name information according to the number of schedulable task information under the domain name information in the token pool 102;
the frequency controller 112 is configured to extract second task information from the task pool 103; the second task information represents a list page task; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; updating the initial grabbing frequency based on the grabbing frequency;
the token generator 105 is configured to scan the token pool 102, determine the number of tokens corresponding to the domain name information according to the capture rate determined by the rate controller 104, and send the corresponding number of tokens to the scheduling queue 106 when the number of tokens meets a preset condition;
the scheduler 107 is configured to obtain a token from the scheduling queue 106, select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token, and add the first task information to the processing queue 108;
the processor 109 is configured to extract the first task information from the processing queue 108, and execute fetching of corresponding data according to the first task information.
Specifically, the operator configures the project content through the configuration unit 110, where the configured project content includes a project script program; the project script program comprises the address information of the webpage to be captured; the address information such as a URL; and generating project information based on the configured project content, and sending the project information to the project pool 111. As shown in FIG. 3, an operator may configure data capture rules in the configuration interface shown in FIG. 3; and after the configuration is completed, generating a project script program. Further, the project launcher 101 may detect whether the address information exists in the task pool 103 through address information (specifically, URL) configured in the project script program in the process of executing the Onstart () function by executing the Onstart () function of the project script program in the project information; when it is determined that the address information does not exist in the task pool 103, a task identifier (i.e., a task ID) corresponding to the address information and a preconfigured initial grabbing frequency are automatically generated, and the address information, the task identifier, and the initial grabbing frequency are added to the task pool 103 as task information. The task information may further include: callback functions and priorities, etc.
In this embodiment of the present invention, the Token may be specifically denoted as Token; which characterizes a data structure; accordingly, the Token generator 105 may also be referred to as a Token generator. Specifically, the project launcher 101 extracts domain name information corresponding to the task information (specifically, URL), and determines whether the domain name information exists in the token pool 102; when the domain name information is not in the token pool 102, generating a token by using the domain name as a key (the key may be represented as a key), and using a preconfigured initial capture rate, a last token generation time, a domain name IP, and a proxy IP as key values (the key values may be represented as values); the domain name IP can be obtained by performing DNS analysis on the domain name information; the proxy IP is preset in the project script program. Further, the status of the project information in the project pool 111 is updated to an installed status.
In this embodiment, the rate controller 104 determines the capturing rate corresponding to the domain name information according to the description in the first embodiment, which is not repeated herein. Further, after the rate controller 104 determines the capture rate, the initial capture rate of the corresponding domain name information in the token pool 102 is updated according to the capture rate.
In this embodiment, the scan token pool 102 calculates the number of tokens correspondingly generated under each domain name information according to formula (3) in the first embodiment. When the obtained number N of tokens is greater than or equal to 1, obtaining an integer part of the number N of tokens, for example, the integer part is M, and sending M tokens to the scheduling queue 106; assigning the decimal part of the token number N to a remainder; and when the obtained token number N is less than 1, directly assigning the token number N to the remainder. After completion, nowtime is assigned to lasttime, and the last token generation time is updated.
In this embodiment, the frequency controller 112 extracts second task information from the task pool 103, where the second task information represents a list page task; the list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; that is, the second task information includes a plurality of subtasks. The frequency controller 112 counts the number of subtasks within a preset time range based on the second task information, and if the number of subtasks is n, the capture frequency age corresponding to the second task information determined by the frequency controller 112 satisfies the following formula (4):
wherein n represents the number of subtasks within a preset time range counted based on the second task information; t is1、T2And T3Are all positive integers. The unit of the obtained capture frequency age is second, which represents a time interval at which the scheduler 107 selects the first task information meeting the second preset condition in the task pool 103 according to the domain name information corresponding to the token.
Preferably, T1Is 86400, T2Is 7200, T3Is 60; equation (4) can be expressed as:
of course, equation (5) is only one example of the grabbing frequency, where T1、T2And T3Other values may be used, and this embodiment is not particularly limited.
Further, the scheduler 107 is configured to obtain a token from the scheduling queue 106, and select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token. Wherein the second preset condition is: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
Further, the scheduler 107 is configured to sort the plurality of first task information meeting the second preset condition according to priority and/or hierarchy, and preferentially add the task information meeting the high priority and/or the low hierarchy to the processing queue 108.
Specifically, when the time that meets the capture frequency representation is reached, the scheduler 107 extracts a token from the scheduling queue 106, and selects one piece of task information from the task pool 103 to schedule according to domain name information carried in the token; wherein, the selected task information simultaneously satisfies the following three conditions: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and screening to obtain the first task information, wherein the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time. Further, sorting is performed according to priority information and/or hierarchy information carried in the first task information, for example, sorting is performed according to the priority information from high to low, and sorting is performed according to the hierarchy information from low to high; selecting the task with highest priority and/or lowest hierarchy to be added to the processing queue 108 preferentially; specifically, the selected task identifier (for example, task ID) and the proxy IP carried in the token corresponding to the task information are added to the processing queue 108. Wherein the hierarchy information can be understood as: the task of capturing the home page of the webpage can be recorded as level 1; the task of capturing the sub-page of the home page of the webpage is recorded as a level 2; and so on.
Further, specifically, the processor 109 extracts a task identifier (for example, a task ID) and an agent IP from the processing queue 108, and obtains address information (for example, a URL), a callback function, source address information (refer) of the address information, user agent information (user agent), a cookie, and other information associated with the task identifier from the task pool 103 according to the task identifier (for example, the task ID). The refer identifies the address information of the last page corresponding to the address information, and can be understood as the source address information of the address information; the user agent information (user agent) may specifically be browser information, including information such as a hardware platform, system software, and application software. Further, the processor 109 captures corresponding web page data according to the information, analyzes the captured web page data according to the callback function, and stores the analyzed specific content (the specific content, such as news, articles, and the like) into a result pool; storing the analyzed new task into the task pool 103; when the task processed by the processor 109 is a detail page task, data capture is performed to obtain specific content, and correspondingly, the state of the task is updated to be processed, so that the corresponding task cannot be scheduled again; when the task processed by the processor 109 is a list page task, a new task is obtained after data capture, and correspondingly, the state of the task is updated to be in waiting scheduling, so that the corresponding task is scheduled again when the time is up. The list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; the detail page task is a task type, and the web page obtained after the data capture of the type is executed is a specific content (such as news, articles and other contents) web page. And further, updating the processing time of the task to be the current time.
By adopting the technical scheme of the embodiment of the invention, on one hand, the capturing speed corresponding to the domain name information is determined by the speed controller according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced; on the other hand, the frequency controller counts the number of subtasks in the preset time range in the task pool to determine the grabbing frequency, so that the grabbing frequency is automatically adjusted, the grabbing frequency of the list page task is automatically and accurately adjusted, and the delay time from the publication of the webpage data to the grabbing of the webpage data is reduced.
EXAMPLE III
The embodiment of the invention also provides a server. FIG. 5 is a diagram illustrating a fourth component structure of a server according to an embodiment of the present invention; as shown in fig. 5, the server includes: a configuration unit 110, an item pool 111, an item initiator 101, a token pool 102, a task pool 103, a rate controller 104, a frequency controller 112, a token generator 105, a scheduling queue 106, a scheduler 107, a processing queue 108, a processor 109, and a domain name resolver 113; wherein,
the configuration unit 110 is configured to configure a grab item, generate corresponding item information based on the configured grab item, and send the item information to the item pool 111;
the project pool 111 is used for storing project information; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
the project launcher 101 is configured to scan project information in an uninstalled state in the project pool 111, and detect whether a task in the project information is in the task pool 103; when the task is not in the task pool 103, adding task information of the task to the task pool 103; the task information includes: the method comprises the following steps of (1) address information, task identification and initial grabbing frequency of a task; further configured to detect whether the domain name information is in the token pool 102; when the domain name information is not in the token pool 102, generating a token according to the domain name information, and adding the token into the token pool 102; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP and proxy IP;
the task pool 103 is used for storing task information;
the token pool 102 is used for storing tokens;
the rate controller 104 is configured to determine a capturing rate corresponding to the domain name information according to the number of schedulable task information under the domain name information in the token pool 102;
the frequency controller 112 is configured to extract second task information from the task pool 103; the second task information represents a list page task; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; updating the initial grabbing frequency based on the grabbing frequency;
the token generator 105 is configured to scan the token pool 102, determine the number of tokens corresponding to the domain name information according to the capture rate determined by the rate controller 104, and send the corresponding number of tokens to the scheduling queue 106 when the number of tokens meets a preset condition;
the scheduler 107 is configured to obtain a token from the scheduling queue 106, select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token, and add the first task information to the processing queue 108;
the processor 109 is configured to extract the first task information from the processing queue 108, and execute fetching of corresponding data according to the first task information;
the domain name resolver 113 is configured to resolve the domain name information in the token pool 102 according to a preset period, and obtain a first domain name IP corresponding to the domain name information; comparing the first domain name IP obtained by the resolution with the domain name IPs in the token pool 102; when the first domain name IP is in the token pool 102, updating the resolution time of the domain name IP corresponding to the first domain name IP in the token pool 102; adding the first domain name IP to the token pool 102 when the first domain name IP is not in the token pool 102; and when the second domain name IP in the token pool 102 is not in the first domain name IP obtained by resolution, deleting the second domain name IP from the token pool 102.
Specifically, the operator configures the project content through the configuration unit 110, where the configured project content includes a project script program; the project script program comprises the address information of the webpage to be captured; the address information such as a URL; and generating project information based on the configured project content, and sending the project information to the project pool 111. As shown in FIG. 3, an operator may configure data capture rules in the configuration interface shown in FIG. 3; and after the configuration is completed, generating a project script program. Further, the project launcher 101 may detect whether the address information exists in the task pool 103 through address information (specifically, URL) configured in the project script program in the process of executing the Onstart () function by executing the Onstart () function of the project script program in the project information; when it is determined that the address information does not exist in the task pool 103, a task identifier (i.e., a task ID) corresponding to the address information and a preconfigured initial grabbing frequency are automatically generated, and the address information, the task identifier, and the initial grabbing frequency are added to the task pool 103 as task information. The task information may further include: callback functions and priorities, etc.
In this embodiment of the present invention, the Token may be specifically denoted as Token; which characterizes a data structure; accordingly, the Token generator 105 may also be referred to as a Token generator. Specifically, the project launcher 101 extracts domain name information corresponding to the task information (specifically, URL), and determines whether the domain name information exists in the token pool 102; when the domain name information is not in the token pool 102, generating a token by using the domain name as a key (the key may be represented as a key), and using a preconfigured initial capture rate, a last token generation time, a domain name IP, and a proxy IP as key values (the key values may be represented as values); the domain name IP can be obtained by performing DNS analysis on the domain name information; the proxy IP is preset in the project script program. Further, the status of the project information in the project pool 111 is updated to an installed status.
In this embodiment, the rate controller 104 determines the capturing rate corresponding to the domain name information according to the description in the first embodiment, which is not repeated herein. Further, after the rate controller 104 determines the capture rate, the initial capture rate of the corresponding domain name information in the token pool 102 is updated according to the capture rate.
In this embodiment, the scan token pool 102 calculates the number of tokens correspondingly generated under each domain name information according to formula (3) in the first embodiment. When the obtained number N of tokens is greater than or equal to 1, obtaining an integer part of the number N of tokens, for example, the integer part is M, and sending M tokens to the scheduling queue 106; assigning the decimal part of the token number N to a remainder; and when the obtained token number N is less than 1, directly assigning the token number N to the remainder. After completion, nowtime is assigned to lasttime, and the last token generation time is updated.
In this embodiment, the frequency controller 112 extracts second task information from the task pool 103, where the second task information represents a list page task; the list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; that is, the second task information includes a plurality of subtasks. The frequency controller 112 counts the number of subtasks within a preset time range based on the second task information, and if the number of subtasks is n, the capture frequency age corresponding to the second task information determined by the frequency controller 112 satisfies formula (4) shown in embodiment two, which is not described herein again.
Further, the scheduler 107 is configured to obtain a token from the scheduling queue 106, and select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token. Wherein the second preset condition is: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
Further, the scheduler 107 is configured to sort the plurality of first task information meeting the second preset condition according to priority and/or hierarchy, and preferentially add the task information meeting the high priority and/or the low hierarchy to the processing queue 108.
Specifically, when the time that meets the capture frequency representation is reached, the scheduler 107 extracts a token from the scheduling queue 106, and selects one piece of task information from the task pool 103 to schedule according to domain name information carried in the token; wherein, the selected task information simultaneously satisfies the following three conditions: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and screening to obtain the first task information, wherein the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time. Further, sorting is performed according to priority information and/or hierarchy information carried in the first task information, for example, sorting is performed according to the priority information from high to low, and sorting is performed according to the hierarchy information from low to high; selecting the task with highest priority and/or lowest hierarchy to be added to the processing queue 108 preferentially; specifically, the selected task identifier (for example, task ID) and the proxy IP carried in the token corresponding to the task information are added to the processing queue 108. Wherein the hierarchy information can be understood as: the task of capturing the home page of the webpage can be recorded as level 1; the task of capturing the sub-page of the home page of the webpage is recorded as a level 2; and so on.
Further, specifically, the processor 109 extracts a task identifier (for example, a task ID) and an agent IP from the processing queue 108, and obtains address information (for example, a URL), a callback function, source address information (refer) of the address information, user agent information (user agent), a cookie, and other information associated with the task identifier from the task pool 103 according to the task identifier (for example, the task ID). The refer identifies the address information of the last page corresponding to the address information, and can be understood as the source address information of the address information; the user agent information (user agent) may specifically be browser information, including information such as a hardware platform, system software, and application software. Further, the processor 109 captures corresponding web page data according to the information, analyzes the captured web page data according to the callback function, and stores the analyzed specific content (the specific content, such as news, articles, and the like) into a result pool; storing the analyzed new task into the task pool 103; when the task processed by the processor 109 is a detail page task, data capture is performed to obtain specific content, and correspondingly, the state of the task is updated to be processed, so that the corresponding task cannot be scheduled again; when the task processed by the processor 109 is a list page task, a new task is obtained after data capture, and correspondingly, the state of the task is updated to be in waiting scheduling, so that the corresponding task is scheduled again when the time is up. The list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; the detail page task is a task type, and the web page obtained after the data capture of the type is executed is a specific content (such as news, articles and other contents) web page. And further, updating the processing time of the task to be the current time.
In this embodiment, the domain name resolver 113 is configured to periodically perform DNS resolution on the domain name information in the token pool 102 to obtain a domain name IP (denoted as a first domain name IP) corresponding to the domain name information, and compare the first domain name IP obtained through resolution with the domain name IP in the token pool 102, so as to determine whether the domain name IP in the token pool 102 is invalid (a deletion operation is performed after the domain name IP is invalid), thereby avoiding that a task corresponding to the invalid domain name IP is rescheduled; or determine whether the domain name IP in the token pool 102 is deleted by mistake (i.e., the addition operation of the domain name IP is performed by the mistake), and so on.
By adopting the technical scheme of the embodiment of the invention, on the first hand, the capturing speed corresponding to the domain name information is determined by the speed controller according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced; in the second aspect, the grabbing frequency is determined by counting the number of subtasks in a preset time range in the task pool through the frequency controller, so that the grabbing frequency is automatically adjusted, the grabbing frequency of the list page tasks is automatically and accurately adjusted, and the delay time from the publication of the webpage data to the grabbing of the webpage data is reduced; in the third aspect, the domain name IP corresponding to the domain name information in the token pool is updated through the domain name resolver, so that the system resource waste caused by repeated resolution of the same domain name is avoided, the system processing speed is improved, and the webpage data capturing amount is also improved.
Example four
The embodiment of the invention also provides a server. FIG. 6 is a diagram illustrating a fourth component structure of the server according to the embodiment of the present invention; as shown in fig. 6, the server includes: a configuration unit 110, an item pool 111, an item launcher 101, a token pool 102, a task pool 103, a rate controller 104, a frequency controller 112, a token generator 105, a plurality of scheduling queues 106, a plurality of schedulers 107, a processing queue 108, a processor 109, and a domain name resolver 113; wherein,
the configuration unit 110 is configured to configure a grab item, generate corresponding item information based on the configured grab item, and send the item information to the item pool 111;
the project pool 111 is used for storing project information; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
the project launcher 101 is configured to scan project information in an uninstalled state in the project pool 111, and detect whether a task in the project information is in the task pool 103; when the task is not in the task pool 103, adding task information of the task to the task pool 103; the task information includes: the method comprises the following steps of (1) address information, task identification and initial grabbing frequency of a task; further configured to detect whether the domain name information is in the token pool 102; when the domain name information is not in the token pool 102, generating a token according to the domain name information, and adding the token into the token pool 102; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP and proxy IP;
the task pool 103 is used for storing task information;
the token pool 102 is used for storing tokens;
the rate controller 104 is configured to determine a capturing rate corresponding to the domain name information according to the number of schedulable task information under the domain name information in the token pool 102;
the frequency controller 112 is configured to extract second task information from the task pool 103; the second task information represents a list page task; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; updating the initial grabbing frequency based on the grabbing frequency;
the token generator 105 is configured to scan the token pool 102, determine the number of tokens corresponding to the domain name information according to the capture rate determined by the rate controller 104, and send the corresponding number of tokens to the scheduling queue 106 when the number of tokens meets a preset condition;
the scheduler 107 is configured to obtain a token from the scheduling queue 106, select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token, and add the first task information to the processing queue 108;
the processor 109 is configured to extract the first task information from the processing queue 108, and execute fetching of corresponding data according to the first task information;
the domain name resolver 113 is configured to resolve the domain name information in the token pool 102 according to a preset period, and obtain a first domain name IP corresponding to the domain name information; comparing the first domain name IP obtained by the resolution with the domain name IPs in the token pool 102; when the first domain name IP is in the token pool 102, updating the resolution time of the domain name IP corresponding to the first domain name IP in the token pool 102; adding the first domain name IP to the token pool 102 when the first domain name IP is not in the token pool 102; when the second domain name IP in the token pool 102 is not in the first domain name IP obtained by resolution, deleting the second domain name IP from the token pool 102;
wherein, the schedulers 107 are in one-to-one correspondence with the scheduling queues 106;
the token generator 105 is configured to process the domain name information according to a preset processing manner; sending the token corresponding to the domain name information to a first scheduling queue 106 corresponding to a processing result; wherein the first scheduling queue 106 is one of a plurality of scheduling queues 106;
correspondingly, the scheduler 107 is configured to obtain a token from the corresponding scheduling queue 106, select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token, and add the first task information to the processing queue 108.
Specifically, the operator configures the project content through the configuration unit 110, where the configured project content includes a project script program; the project script program comprises the address information of the webpage to be captured; the address information such as a URL; and generating project information based on the configured project content, and sending the project information to the project pool 111. As shown in FIG. 3, an operator may configure data capture rules in the configuration interface shown in FIG. 3; and after the configuration is completed, generating a project script program. Further, the project launcher 101 may detect whether the address information exists in the task pool 103 through address information (specifically, URL) configured in the project script program in the process of executing the Onstart () function by executing the Onstart () function of the project script program in the project information; when it is determined that the address information does not exist in the task pool 103, a task identifier (i.e., a task ID) corresponding to the address information and a preconfigured initial grabbing frequency are automatically generated, and the address information, the task identifier, and the initial grabbing frequency are added to the task pool 103 as task information. The task information may further include: callback functions and priorities, etc.
In this embodiment of the present invention, the Token may be specifically denoted as Token; which characterizes a data structure; accordingly, the Token generator 105 may also be referred to as a Token generator. Specifically, the project launcher 101 extracts domain name information corresponding to the task information (specifically, URL), and determines whether the domain name information exists in the token pool 102; when the domain name information is not in the token pool 102, generating a token by using the domain name as a key (the key may be represented as a key), and using a preconfigured initial capture rate, a last token generation time, a domain name IP, and a proxy IP as key values (the key values may be represented as values); the domain name IP can be obtained by performing DNS analysis on the domain name information; the proxy IP is preset in the project script program. Further, the status of the project information in the project pool 111 is updated to an installed status.
In this embodiment, the rate controller 104 determines the capturing rate corresponding to the domain name information according to the description in the first embodiment, which is not repeated herein. Further, after the rate controller 104 determines the capture rate, the initial capture rate of the corresponding domain name information in the token pool 102 is updated according to the capture rate.
In this embodiment, the token generator 105 scans the token pool 102 to calculate the number of tokens correspondingly generated under each domain name information according to formula (3) in the first embodiment. When the obtained number N of tokens is greater than or equal to 1, obtaining an integer part of the number N of tokens, for example, the integer part is M, and sending M tokens to the scheduling queue 106; assigning the decimal part of the token number N to a remainder; and when the obtained token number N is less than 1, directly assigning the token number N to the remainder. After completion, nowtime is assigned to lasttime, and the last token generation time is updated.
The token generator 105 processes the domain name information according to a preset processing mode (the preset processing mode may specifically be a cyclic redundancy check (CRC32) algorithm), and obtains a first scheduling queue 106 corresponding to a processing result; it can be understood that tokens corresponding to the same domain name information are uniformly sent to one scheduling queue 106, so as to avoid the problem of contention caused by simultaneous operation of tokens corresponding to the same domain name information by different subsequent schedulers 107.
In this embodiment, the frequency controller 112 extracts second task information from the task pool 103, where the second task information represents a list page task; the list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; that is, the second task information includes a plurality of subtasks. The frequency controller 112 counts the number of subtasks within a preset time range based on the second task information, and if the number of subtasks is n, the capture frequency age corresponding to the second task information determined by the frequency controller 112 satisfies formula (4) shown in embodiment two, which is not described herein again.
Further, the scheduler 107 is configured to obtain a token from the scheduling queue 106, and select, in the task pool 103, first task information that meets a second preset condition according to domain name information corresponding to the token. Wherein the second preset condition is: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
Further, the scheduler 107 is configured to sort the plurality of first task information meeting the second preset condition according to priority and/or hierarchy, and preferentially add the task information meeting the high priority and/or the low hierarchy to the processing queue 108.
Specifically, in this embodiment, when the time that satisfies the capture frequency representation is reached, the scheduler 107 extracts a token from the scheduling queue 106 corresponding to the scheduler 107, and selects one piece of task information from the task pool 103 for scheduling according to domain name information carried in the token; wherein, the selected task information simultaneously satisfies the following three conditions: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and screening to obtain the first task information, wherein the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time. Further, sorting is performed according to priority information and/or hierarchy information carried in the first task information, for example, sorting is performed according to the priority information from high to low, and sorting is performed according to the hierarchy information from low to high; selecting the task with highest priority and/or lowest hierarchy to be added to the processing queue 108 preferentially; specifically, the selected task identifier (for example, task ID) and the proxy IP carried in the token corresponding to the task information are added to the processing queue 108. Wherein the hierarchy information can be understood as: the task of capturing the home page of the webpage can be recorded as level 1; the task of capturing the sub-page of the home page of the webpage is recorded as a level 2; and so on.
Further, specifically, the processor 109 extracts a task identifier (for example, a task ID) and an agent IP from the processing queue 108, and obtains address information (for example, a URL), a callback function, source address information (refer) of the address information, user agent information (user agent), a cookie, and other information associated with the task identifier from the task pool 103 according to the task identifier (for example, the task ID). The refer identifies the address information of the last page corresponding to the address information, and can be understood as the source address information of the address information; the user agent information (user agent) may specifically be browser information, including information such as a hardware platform, system software, and application software. Further, the processor 109 captures corresponding web page data according to the information, analyzes the captured web page data according to the callback function, and stores the analyzed specific content (the specific content, such as news, articles, and the like) into a result pool; storing the analyzed new task into the task pool 103; when the task processed by the processor 109 is a detail page task, data capture is performed to obtain specific content, and correspondingly, the state of the task is updated to be processed, so that the corresponding task cannot be scheduled again; when the task processed by the processor 109 is a list page task, a new task is obtained after data capture, and correspondingly, the state of the task is updated to be in waiting scheduling, so that the corresponding task is scheduled again when the time is up. The list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; the detail page task is a task type, and the web page obtained after the data capture of the type is executed is a specific content (such as news, articles and other contents) web page. And further, updating the processing time of the task to be the current time.
In this embodiment, the domain name resolver 113 is configured to periodically perform DNS resolution on the domain name information in the token pool 102 to obtain a domain name IP (denoted as a first domain name IP) corresponding to the domain name information, and compare the first domain name IP obtained through resolution with the domain name IP in the token pool 102, so as to determine whether the domain name IP in the token pool 102 is invalid (a deletion operation is performed after the domain name IP is invalid), thereby avoiding that a task corresponding to the invalid domain name IP is rescheduled; or determine whether the domain name IP in the token pool 102 is deleted by mistake (i.e., the addition operation of the domain name IP is performed by the mistake), and so on.
By adopting the technical scheme of the embodiment of the invention, on the first hand, the capturing speed corresponding to the domain name information is determined by the speed controller according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced; in the second aspect, the grabbing frequency is determined by counting the number of subtasks in a preset time range in the task pool through the frequency controller, so that the grabbing frequency is automatically adjusted, the grabbing frequency of the list page tasks is automatically and accurately adjusted, and the delay time from the publication of the webpage data to the grabbing of the webpage data is reduced; in the third aspect, the domain name IP corresponding to the domain name information in the token pool is updated through the domain name resolver, so that the system resource waste caused by repeated resolution of the same domain name is avoided, the system processing speed is improved, and the webpage data capturing amount is also improved; in a fourth aspect, in the embodiment, a multi-scheduler (corresponding to a multi-scheduling queue) architecture is adopted, when the task amount is large, the configuration of the scheduler can be increased, the expansibility is strong, and the load capacity of the server is greatly improved.
In the first to fourth embodiments, the configuration Unit 110, the item pool 111, the item initiator 101, the token pool 102, the task pool 103, the rate controller 104, the frequency controller 112, the token generator 105, the dispatch queue 106, the scheduler 107, the Processing queue 108, the Processor 109, and the domain name resolver 113 in the server may be implemented by a Central Processing Unit (CPU) 109, a digital signal Processor 109 (DSP), or a Programmable Gate Array (FPGA) in the server in practical applications.
The composition framework of the server in the embodiment adopts a multi-node distributed design, so that the abnormity can be quickly and accurately positioned and processed when the abnormity appears, the human resources are greatly saved, and the maintenance cost is reduced.
EXAMPLE five
Based on the first embodiment, the embodiment of the invention also provides an information processing method. FIG. 7 is a flowchart illustrating a first information processing method according to an embodiment of the present invention; as shown in fig. 7, the information processing method includes:
step 201: adding task information to be subjected to data capture into a task pool; the task information includes: the address information of the task, the task identification and the initial grabbing frequency.
Here, the adding task information to be subjected to data capture to the task pool includes: configuring a grabbing item, generating corresponding item information based on the configured grabbing item, and sending the item information to an item pool for storage; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status; scanning project information in an uninstalled state in the project pool, and detecting whether a task in the project information is in the task pool; and when the task is not in the task pool, adding the task information of the task into the task pool.
Specifically, the operator may configure the project content, where the configured project content includes a project script program; the project script program comprises the address information of the webpage to be captured; the address information such as a URL; and generating project information based on the configured project content, and sending the project information to a project pool. Referring to fig. 3, an operator may configure a data capture rule in the configuration interface shown in fig. 3; and after the configuration is completed, generating a project script program.
Further, whether the address information exists in the task pool or not can be detected through address information (specifically, a URL) configured in the project script program in the process of executing the Onstart () function by executing the Onstart () function of the project script program in the project information; when the address information does not exist in the task pool, automatically generating a task identifier (namely a task ID) corresponding to the address information and a pre-configured initial grabbing frequency, and adding the address information, the task identifier and the initial grabbing frequency into the task pool as task information.
Step 202: extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP, and proxy IP.
In this embodiment, the Token may be specifically denoted as Token; which characterizes a data structure. Specifically, domain name information corresponding to the task information (specifically, extracting a URL) is extracted, and whether the domain name information exists in the token pool is determined; when the domain name information is not in the token pool, generating a token by taking the domain name as a key (the key can be represented as a key), and taking a pre-configured initial capture rate, last token generation time, domain name IP and proxy IP as key values (the key values can be represented as values); the Domain Name IP can be obtained by performing Domain Name System (DNS) resolution on the Domain Name information; the proxy IP is preset in the project script program. Further, the state of the project information in the project pool is updated to an installed state.
Step 203: and determining the capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool.
Here, the determining, according to the number of schedulable task information under the domain name information in the token pool, a capture rate corresponding to the domain name information includes:
according to the domain name information of the tokens in the token pool, counting the quantity of task information which corresponds to the domain name information and meets a third preset condition in the task pool; wherein the third preset condition comprises: the state of the task information is schedulable, and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time;
the capture rate corresponding to the domain name information satisfies the following formula (6):
wherein n represents the number of task information; x and Y are both positive integers.
Specifically, domain name information is extracted from the token pool, and the number of tasks with the task state being schedulable and the sum of the last scheduling time of the tasks and the grabbing frequency of the tasks being less than the current time under the domain name information is counted and recorded as n; calculating the corresponding capture rate of the domain name information according to a formula (5); the capture rate represents the capture rate of the webpage data corresponding to the domain name information.
Preferably, X is 360 and Y is 3600, i.e. equation (6) can be expressed as:
in equation (7), 3600 indicates that 3600 seconds are available for 1 hour, and the grab rate indicates the amount of tasks processed per second + 0.1. Of course, the formula (7) is only an example of the grabbing rate, where X and Y may also be any positive integer, and this embodiment is not particularly limited.
And further, after the capturing rate is determined, updating the initial capturing rate of the corresponding domain name information in the token pool according to the capturing rate.
Step 204: and scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capturing rate, and sending the tokens with the corresponding number to a scheduling queue when the number of tokens meets a preset condition.
Here, the determining the number of tokens corresponding to the domain name information according to the fetching rate includes:
setting rate to represent the corresponding capture rate of the domain name information, lasttime to represent the last token generation time, nowtime to represent the current time, and remainder to represent the number of the tokens left when the tokens are generated; the token number N corresponding to the domain name information satisfies formula (8):
N=rate×(nowtime-lasttime)+remainder (8)
further, when the number of tokens meets a preset condition, sending a corresponding number of tokens to a scheduling queue includes: when the number N of the tokens is more than or equal to 1, obtaining an integer part of the number N of the tokens, and sending the tokens meeting the number of the integer part to a scheduling queue; assigning a fractional part of the token number N to the remainder; and when the token number N is less than 1, directly assigning the token number N to the remainder.
Specifically, the token pool is scanned, and the number of tokens correspondingly generated under each domain name information is calculated according to a formula (8). When the obtained number N of tokens is greater than or equal to 1, obtaining an integer part of the number N of tokens, for example, the integer part is M, and sending M tokens to a scheduling queue; assigning the decimal part of the token number N to a remainder; and when the obtained token number N is less than 1, directly assigning the token number N to the remainder. After completion, nowtime is assigned to lasttime, and the last token generation time is updated.
Step 205: and obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue.
Here, the selecting, in the task pool, first task information that satisfies a second preset condition according to domain name information corresponding to the token includes: selecting first task information meeting the following conditions in the task pool according to the domain name information corresponding to the token: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
Further, the adding the first task information to a processing queue includes: and sequencing the first task information meeting the second preset condition according to the priority and/or the hierarchy, and preferentially adding the task information meeting the high priority and/or the low hierarchy into a processing queue.
Specifically, a token is extracted from a scheduling queue, and one task information is selected from the task pool for scheduling according to domain name information carried in the token; wherein, the selected task information simultaneously satisfies the following three conditions: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and screening to obtain the first task information, wherein the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time. Further, sorting is performed according to priority information and/or hierarchy information carried in the first task information, for example, sorting is performed according to the priority information from high to low, and sorting is performed according to the hierarchy information from low to high; selecting tasks with highest priority and/or lowest hierarchy to be preferentially added into a processing queue; specifically, the selected task identifier (for example, task ID) and the proxy IP carried in the token corresponding to the task information are added to the processing queue. Wherein the hierarchy information can be understood as: the task of capturing the home page of the webpage can be recorded as level 1; the task of capturing the sub-page of the home page of the webpage is recorded as a level 2; and so on.
Step 206: and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information.
In this embodiment, specifically, a task identifier (for example, a task ID) and an agent IP are extracted from the processing queue, and address information (for example, a URL), a callback function, source address information (refer) of the address information, user agent information (user agent), a cookie, and other information associated with the task identifier is obtained from the task pool according to the task identifier (for example, the task ID). The refer identifies the address information of the last page corresponding to the address information, and can be understood as the source address information of the address information; the user agent information (user agent) may specifically be browser information, including information such as a hardware platform, system software, and application software. Further, capturing corresponding webpage data according to the information, analyzing the captured webpage data according to the callback function, and storing the analyzed specific content (the specific content such as news, articles and the like) into a result pool; storing the analyzed new task into a task pool; when the processed task is a detail page task, acquiring specific content after data capture, and correspondingly updating the state of the task to be processed, so that the corresponding task cannot be scheduled again; and when the processed task is a list page task, acquiring a new task after data capture, and correspondingly updating the state of the task into a waiting scheduling state so that the corresponding task is scheduled again when the time is up. The list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; the detail page task is a task type, and the web page obtained after the data capture of the type is executed is a specific content (such as news, articles and other contents) web page. And further, updating the processing time of the task to be the current time.
By adopting the technical scheme of the embodiment of the invention, the capturing speed corresponding to the domain name information is determined according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced.
EXAMPLE six
Based on the second embodiment, the embodiment of the invention also provides an information processing method. FIG. 8 is a flowchart illustrating a second information processing method according to an embodiment of the present invention; as shown in fig. 8, the information processing method includes:
step 301: configuring a grabbing item, generating corresponding item information based on the configured grabbing item, and sending the item information to an item pool for storage; the project information comprises a project script program and the state of a project; the status of the item includes an installed status and an uninstalled status.
Step 302: scanning project information in an uninstalled state in the project pool, and detecting whether a task in the project information is in the task pool; when the task is not in the task pool, adding task information of the task to the task pool; the task information includes: the address information of the task, the task identification and the initial grabbing frequency.
Step 303: extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP, and proxy IP.
Step 304: and determining the capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool.
Step 305: and scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capturing rate, and sending the tokens with the corresponding number to a scheduling queue when the number of tokens meets a preset condition.
Step 306: extracting second task information from the task pool; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; the second task information characterizes a list page task.
Step 307: and obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue.
Step 308: and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information.
The same contents in this embodiment as those in embodiment six are not described herein again, and specific reference may be made to the corresponding descriptions in embodiment six.
The difference from the sixth embodiment is that in step 306 of this embodiment, second task information is extracted from the task pool, where the second task information represents a list page task; the list page task is a task type, and a webpage obtained after data capturing of the type is executed is a navigation page; that is, the second task information includes a plurality of subtasks. The frequency controller counts the number of subtasks within a preset time range based on the second task information, and if the number of the subtasks is n, the capture frequency age corresponding to the second task information determined by the frequency controller satisfies the following formula (9):
wherein n represents the number of subtasks within a preset time range counted based on the second task information; t is1、T2And T3Are all positive integers. The unit of the obtained capture frequency age is second, which represents the time interval for the scheduler to select the first task information meeting the second preset condition in the task pool according to the domain name information corresponding to the token.
Preferably, T1Is 86400, T2Is 7200, T3Is 60; equation (9) can be expressed as:
of course, equation (10) is only one example of the grabbing frequency, where T1、T2And T3Other values may be used, and this embodiment is not particularly limited.
In addition, in the technical solution described in step 306 of this embodiment, the execution sequence of the technical solution described in step 306 is after step 302, that is, after the task information is added to the task pool, and this embodiment does not specifically limit the execution sequence of the technical solution described in step 306.
Further, in step 307, a token is obtained from the scheduling queue, and first task information meeting a second preset condition is selected from the task pool according to domain name information corresponding to the token. Wherein the second preset condition is: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency of the task is less than the current time.
By adopting the technical scheme of the embodiment of the invention, on one hand, the capturing speed corresponding to the domain name information is determined according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced; on the other hand, the grabbing frequency is determined by counting the number of the subtasks in the preset time range in the task pool, so that the grabbing frequency is automatically adjusted, the grabbing frequency of the list page tasks is automatically and accurately adjusted, and the delay time from the publication of the webpage data to the grabbing of the webpage data is reduced.
EXAMPLE seven
Based on the third embodiment, the embodiment of the invention also provides an information processing method. FIG. 9 is a flowchart illustrating a third information processing method according to an embodiment of the present invention; as shown in fig. 9, the information processing method includes:
step 401: configuring a grabbing item, generating corresponding item information based on the configured grabbing item, and sending the item information to an item pool for storage; the project information comprises a project script program and the state of a project; the status of the item includes an installed status and an uninstalled status.
Step 402: scanning project information in an uninstalled state in the project pool, and detecting whether a task in the project information is in the task pool; when the task is not in the task pool, adding task information of the task to the task pool; the task information includes: the address information of the task, the task identification and the initial grabbing frequency.
Step 403: extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool; the token comprises: domain name information, initial capture rate, last token generation time, domain name IP, and proxy IP.
Step 404: and determining the capturing rate corresponding to the domain name information according to the schedulable task information quantity under the domain name information in the token pool.
Step 405: and scanning the token pool, determining the number of tokens corresponding to the domain name information according to the capturing rate, and sending the tokens with the corresponding number to a scheduling queue when the number of tokens meets a preset condition.
Step 406: extracting second task information from the task pool; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; the second task information characterizes a list page task.
Step 407: and obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue.
Step 408: and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information.
Step 409: analyzing the domain name information in the token pool according to a preset period to obtain a first domain name IP corresponding to the domain name information; and comparing the first domain name IP obtained by analysis with the domain name IPs in the token pool, and processing the domain name IPs in the token pool according to the comparison result.
The same contents as those in embodiment seven are not described herein again, and specific reference may be made to the corresponding descriptions in embodiment seven.
The difference from the seventh embodiment is that, in step 409 of this embodiment, DNS resolution is performed on the domain name information in the token pool periodically to obtain a domain name IP (denoted as a first domain name IP) corresponding to the domain name information, and the first domain name IP obtained through resolution is compared with the domain name IP in the token pool, so as to determine whether the domain name IP in the token pool is invalid (a deletion operation is performed after the domain name IP is invalid), thereby avoiding that a task corresponding to the invalid domain name IP is rescheduled; or determining whether the domain name IP in the token pool is deleted by mistake (the addition operation of the domain name IP is performed by the deletion by mistake), and the like. Specifically, when the first domain name IP is in the token pool, the resolution time of the domain name IP corresponding to the first domain name IP in the token pool is updated; when the first domain name IP is not in the token pool, adding the first domain name IP to the token pool; and when the second domain name IP in the token pool is not in the first domain name IP obtained by analysis, deleting the second domain name IP from the token pool.
In addition, in the technical solution described in step 409 of this embodiment, the execution sequence of the technical solution described in step 409 may be after step 403, that is, after the token is added to the token pool, and this embodiment does not specifically limit the execution sequence of the technical solution described in step 409.
By adopting the technical scheme of the embodiment of the invention, on the first hand, the capturing speed corresponding to the domain name information is determined according to the schedulable task information quantity in the token pool, so that the automatic adjustment of the capturing speed is realized, the proxy IP is effectively prevented from being sealed, the data capturing efficiency is improved, and the labor cost for manually configuring the capturing speed is reduced; in the second aspect, the grabbing frequency is determined by counting the number of subtasks in the preset time range in the task pool, so that the grabbing frequency is automatically adjusted, the grabbing frequency of the list page tasks is automatically and accurately adjusted, and the delay time from the publication of the webpage data to the grabbing of the webpage data is reduced; in the third aspect, the domain name IP corresponding to the domain name information in the token pool is updated, so that the system resource waste caused by repeated resolution of the same domain name is avoided, the system processing speed is increased, and the webpage data capturing amount is also increased.
As another embodiment, the number of the scheduling queues is multiple; sending the corresponding number of tokens to a scheduling queue includes: processing the domain name information according to a preset processing mode; sending the token corresponding to the domain name information to a first scheduling queue corresponding to a processing result; the first scheduling queue is one of a plurality of scheduling queues.
Specifically, the domain name information is processed according to a preset processing mode (the preset processing mode may specifically be a cyclic redundancy check (CRC32) algorithm), and a first scheduling queue corresponding to a processing result is obtained; it can be understood that tokens corresponding to the same domain name information are uniformly sent to one scheduling queue, so as to avoid the problem of competition caused by the fact that subsequent different schedulers operate the tokens corresponding to the same domain name information at the same time.
Based on this, in the embodiment, a multi-scheduler (corresponding to a multi-scheduling queue) architecture is adopted, and when the task amount is large, the configuration scheduler can be added, so that the expansibility is strong, and the load capacity of the server is greatly improved.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (21)
1. An information processing method, characterized in that the method comprises:
adding task information to be subjected to data capture into a task pool; extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool;
determining a capture rate corresponding to the domain name information according to the schedulable task information quantity in the token pool under the domain name information, wherein the capture rate corresponding to the domain name information satisfies the following expression:
wherein n represents the number of task information; x and Y are both positive integers;
scanning the token pool, and determining the schedulable token number corresponding to the domain name information according to the capturing rate;
obtaining the number of tokens left when the domain name information corresponds to the token generation, and determining the number of tokens N corresponding to the domain name information by combining the schedulable token number corresponding to the domain name information and the number of tokens left when the tokens are generated;
the token number N corresponding to the domain name information satisfies the following expression:
N=rate×(nowtime-lasttime)+remainder;
wherein, rate represents the capture rate corresponding to the domain name information, lasttime represents the last token generation time, nowtime represents the current time, and remainder represents the number of tokens left when the tokens are generated;
when the number N of tokens corresponding to the domain name information meets a preset condition, sending the tokens with the corresponding number to a scheduling queue;
obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue;
and extracting the first task information from the processing queue, and capturing corresponding data according to the first task information.
2. The method according to claim 1, wherein the determining a capture rate corresponding to the domain name information according to the number of schedulable task information under the domain name information in the token pool comprises:
according to the domain name information of the tokens in the token pool, counting the quantity of task information which corresponds to the domain name information and meets a third preset condition in the task pool; wherein the third preset condition comprises: the state of the task information is schedulable, and the sum of the last scheduling time of the task and the grabbing frequency time interval of the task is smaller than the current time.
3. The method according to claim 1, wherein after adding task information to be subjected to data grabbing to a task pool, the method further comprises:
extracting second task information from the task pool; the second task information represents a list page task;
counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks;
updating an initial grabbing frequency based on the grabbing frequency.
5. The method according to claim 1, wherein when the number of tokens satisfies a preset condition, sending a corresponding number of tokens to a scheduling queue comprises:
when the number N of the tokens is more than or equal to 1, obtaining an integer part of the number N of the tokens, and sending the tokens meeting the number of the integer part to a scheduling queue; assigning a fractional part of the token number N to the remainder;
and when the token number N is less than 1, directly assigning the token number N to the remainder.
6. The method of claim 1, wherein the number of scheduling queues is plural; sending the corresponding number of tokens to a scheduling queue includes:
processing the domain name information according to a preset processing mode; sending tokens with the number corresponding to the domain name information to a first scheduling queue corresponding to a processing result; the first scheduling queue is one of a plurality of scheduling queues.
7. The method of claim 1, wherein the token comprises: domain name information and domain name Internet Protocol (IP); the method further comprises the following steps: analyzing the domain name information in the token pool according to a preset period to obtain a first domain name IP corresponding to the domain name information;
comparing the first domain name IP obtained by analysis with the domain name IPs in the token pool;
when the first domain name IP is in the token pool, updating the resolution time of the domain name IP corresponding to the first domain name IP in the token pool;
when the first domain name IP is not in the token pool, adding the first domain name IP to the token pool;
and when the second domain name IP in the token pool is not in the first domain name IP obtained by analysis, deleting the second domain name IP from the token pool.
8. The method according to claim 1, wherein the selecting, in the task pool, first task information that satisfies a second preset condition according to domain name information corresponding to the token includes:
selecting first task information meeting the following conditions in the task pool according to the domain name information corresponding to the token:
matching the domain name corresponding to the task information with the domain name information of the token;
and the status of the task information is schedulable;
and the sum of the last scheduling time of the task and the grabbing frequency time interval of the task is less than the current time.
9. The method of claim 8, wherein adding the first task information to a processing queue comprises:
and sequencing the first task information meeting the second preset condition according to the priority and/or the hierarchy, and preferentially adding the task information meeting the high priority and/or the low hierarchy into a processing queue.
10. The method according to claim 1, wherein the adding task information to be subjected to data grabbing to a task pool comprises:
configuring a grabbing item, generating corresponding item information based on the configured grabbing item, and sending the item information to an item pool for storage; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
scanning project information in an uninstalled state in the project pool, and detecting whether a task in the project information is in the task pool;
and when the task is not in the task pool, adding the task information of the task into the task pool.
11. A server, characterized in that the server comprises: the system comprises an item starter, a token pool, a task pool, a rate controller, a token generator, a scheduling queue, a scheduler, a processing queue and a processor; wherein,
the project starter is used for adding task information to be subjected to data capture to a task pool; extracting domain name information of the task information, generating a token according to the domain name information, and adding the token into a token pool;
the task pool is used for storing task information;
the token pool is used for storing tokens;
the rate controller is configured to determine a capture rate corresponding to the domain name information according to the number of schedulable task information in the token pool, where the capture rate corresponding to the domain name information satisfies the following expression:
wherein n represents the number of task information; x and Y are both positive integers;
the token generator is configured to, in response to the token being received,
scanning the token pool, and determining the schedulable token number corresponding to the domain name information according to the capture rate determined by the rate controller;
obtaining the number of tokens left when the domain name information corresponds to the token generation, and determining the number of tokens N corresponding to the domain name information by combining the schedulable token number corresponding to the domain name information and the number of tokens left when the tokens are generated;
the token number N corresponding to the domain name information satisfies the following expression:
N=rate×(nowtime-lasttime)+remainder;
wherein, rate represents the capture rate corresponding to the domain name information, lasttime represents the last token generation time, nowtime represents the current time, and remainder represents the number of tokens left when the tokens are generated;
when the number N of tokens corresponding to the domain name information meets a preset condition, sending the tokens with the corresponding number to a scheduling queue;
the scheduler is used for obtaining a token from the scheduling queue, selecting first task information meeting a second preset condition in the task pool according to domain name information corresponding to the token, and adding the first task information to a processing queue;
the processor is used for extracting the first task information from the processing queue and executing the capture of corresponding data according to the first task information.
12. The server according to claim 11, wherein the rate controller is configured to count, according to domain name information of tokens in the token pool, a number of pieces of task information that correspond to the domain name information and satisfy a third preset condition in the task pool; wherein the third preset condition comprises: the state of the task information is schedulable, and the sum of the last scheduling time of the task and the grabbing frequency time interval of the task is smaller than the current time.
13. The server according to claim 11, wherein the server further comprises a frequency controller for extracting second task information from the task pool; the second task information represents a list page task; counting the number of subtasks within a preset time range based on the second task information, and determining the grabbing frequency corresponding to the second task information based on the number of the subtasks; updating an initial grabbing frequency based on the grabbing frequency.
14. The server according to claim 13, wherein the capture frequency age corresponding to the second task information determined by the frequency controller satisfies the following expression:
wherein n represents the number of subtasks within a preset time range counted based on the second task information; t is1、T2And T3Are all positive integers.
15. The server according to claim 11, wherein the token generator is configured to, when the number N of tokens is greater than or equal to 1, obtain an integer part of the number N of tokens, and send tokens satisfying the number of the integer part to a scheduling queue; assigning a fractional part of the token number N to the remainder; and when the token number N is less than 1, directly assigning the token number N to the remainder.
16. The server according to claim 11, wherein the number of the scheduling queue is plural; the number of the schedulers is multiple; the plurality of schedulers correspond to the plurality of scheduling queues one to one;
the token generator is used for processing the domain name information according to a preset processing mode; sending the token corresponding to the domain name information to a first scheduling queue corresponding to a processing result; the first scheduling queue is one of a plurality of scheduling queues.
17. The server of claim 11, wherein the token comprises: domain name information and domain name Internet Protocol (IP); the server further comprises a domain name resolver, which is used for resolving the domain name information in the token pool according to a preset period to obtain a first domain name IP corresponding to the domain name information; comparing the first domain name IP obtained by analysis with the domain name IPs in the token pool; when the first domain name IP is in the token pool, updating the resolution time of the domain name IP corresponding to the first domain name IP in the token pool; when the first domain name IP is not in the token pool, adding the first domain name IP to the token pool; and when the second domain name IP in the token pool is not in the first domain name IP obtained by analysis, deleting the second domain name IP from the token pool.
18. The server according to claim 11, wherein the scheduler is configured to select, in the task pool, first task information that satisfies the following condition according to domain name information corresponding to the token: matching the domain name corresponding to the task information with the domain name information of the token; and the status of the task information is schedulable; and the sum of the last scheduling time of the task and the grabbing frequency time interval of the task is less than the current time.
19. The server according to claim 18, wherein the scheduler is configured to sort the first task information satisfying the second preset condition according to priority and/or hierarchy, and to add task information satisfying a high priority and/or a low hierarchy to the processing queue preferentially.
20. The server according to claim 11, wherein the server further comprises a configuration unit and a pool of items; wherein,
the configuration unit is used for configuring the grabbing items, generating corresponding item information based on the configured grabbing items, and sending the item information to the item pool;
the project pool is used for storing project information; the project information comprises a project script program and the state of a project; the status of the item comprises an installed status and an uninstalled status;
the project starter is used for scanning project information in an uninstalled state in the project pool and detecting whether a task in the project information is in the task pool; and when the task is not in the task pool any more, adding the task information of the task into the task pool.
21. A computer-readable storage medium, characterized in that it stores program instructions which, when executed, enable the method according to any one of claims 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610031134.9A CN106982268B (en) | 2016-01-18 | 2016-01-18 | Information processing method and server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610031134.9A CN106982268B (en) | 2016-01-18 | 2016-01-18 | Information processing method and server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106982268A CN106982268A (en) | 2017-07-25 |
CN106982268B true CN106982268B (en) | 2020-09-11 |
Family
ID=59340051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610031134.9A Active CN106982268B (en) | 2016-01-18 | 2016-01-18 | Information processing method and server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106982268B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096344B (en) * | 2018-01-29 | 2024-09-20 | 北京京东尚科信息技术有限公司 | Task management method, system, server cluster and computer readable medium |
CN110058941A (en) * | 2019-03-16 | 2019-07-26 | 平安城市建设科技(深圳)有限公司 | Task scheduling and managing method, device, equipment and storage medium |
CN114595457B (en) * | 2020-12-04 | 2024-12-31 | 腾讯科技(深圳)有限公司 | Task processing method, device, computer equipment and storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7552109B2 (en) * | 2003-10-15 | 2009-06-23 | International Business Machines Corporation | System, method, and service for collaborative focused crawling of documents on a network |
US7701944B2 (en) * | 2007-01-19 | 2010-04-20 | International Business Machines Corporation | System and method for crawl policy management utilizing IP address and IP address range |
CN102377641A (en) * | 2010-08-11 | 2012-03-14 | 高通创锐讯通讯科技(上海)有限公司 | Realization method for token bucket algorithm |
US8681630B1 (en) * | 2010-09-21 | 2014-03-25 | Google Inc. | Configurable rate limiting using static token buckets, and applications thereof |
CN103092999B (en) * | 2013-02-22 | 2016-06-29 | 人民搜索网络股份公司 | A kind of webpage capture period modulation method and apparatus |
CN104967698B (en) * | 2015-02-13 | 2018-11-23 | 腾讯科技(深圳)有限公司 | A kind of method and apparatus crawling network data |
-
2016
- 2016-01-18 CN CN201610031134.9A patent/CN106982268B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN106982268A (en) | 2017-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9641546B1 (en) | Electronic device for aggregation, correlation and consolidation of analysis attributes | |
CN107733967B (en) | Processing method, apparatus, computer equipment and storage medium for push information | |
US8335838B2 (en) | Web page load time prediction and simulation | |
CN108762907B (en) | Task processing method and system based on multiple clients | |
RU2015156608A (en) | NETWORK DEVICE AND SERVICE PROCESS MANAGEMENT METHOD | |
US10175954B2 (en) | Method of processing big data, including arranging icons in a workflow GUI by a user, checking process availability and syntax, converting the workflow into execution code, monitoring the workflow, and displaying associated information | |
JP6711000B2 (en) | Information processing apparatus, virus detection method, and program | |
US20120185529A1 (en) | Application server management system, application server management method, management apparatus, application server and computer program | |
CN106982268B (en) | Information processing method and server | |
US11636198B1 (en) | System and method for cybersecurity analyzer update and concurrent management system | |
CN108334530B (en) | User behavior information analysis method, device and storage medium | |
CN111160624B (en) | User intention prediction method, user intention prediction device and terminal equipment | |
CN110119307B (en) | Data processing request processing method and device, storage medium and electronic device | |
EP2996039B1 (en) | Adaptive scheduling jobs of a virus detection batch according to cpu performance | |
US20170337208A1 (en) | Collecting test results in different formats for storage | |
CN110781180A (en) | Data screening method and data screening device | |
CN107948224B (en) | Timeout processing method and device for client request | |
US10402370B2 (en) | Information processing method and electronic apparatus | |
CN106210159B (en) | Domain name resolution method and device | |
CN108681462B (en) | Code amount statistical method and device | |
KR20190069637A (en) | Charging method and system in multi cloud in the same way | |
CN111008146A (en) | Method and system for testing safety of cloud host | |
US11652836B2 (en) | Non-transitory computer-readable storage medium, detection method, and information processing device | |
EP4030325A1 (en) | Information system security | |
CN112052101B (en) | A broadcast processing method, device and computer system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |