CN104252530A

CN104252530A - Single-computer crawler grabbing method and system

Info

Publication number: CN104252530A
Application number: CN201410458191.6A
Authority: CN
Inventors: 廖耀华
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2014-09-10
Filing date: 2014-09-10
Publication date: 2014-12-31
Anticipated expiration: 2034-09-10
Also published as: CN104252530B

Abstract

The invention discloses a single-computer crawler grabbing method and system. The single-computer crawler grabbing method includes acquiring at least one seed including a URL (uniform resource locator), a website number and a type, taking the URLs of the seeds as current URLs, taking the website numbers of the seeds as current website numbers, and taking the types of the seeds as current types; acquiring at least one strategy, and determining at least one crawler grabbing parameter according to the strategies; acquiring rules corresponding to the current types according to the current types; grabbing website data from the current URLs according to the crawler grabbing parameters, and analyzing the website data according to the rules to acquire analysis data. The crawler grabbing parameters are determined through the strategies so as to solve the problems in the process of grabbing, so that working efficiency is improved, grabbing time is increased, and the single-computer crawler grabbing method and system is suitable for websites of various types.

Description

A kind of unit crawler capturing method and system

Technical field

The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.

Background technology

Internet has data and the information of magnanimity, how these data and information is converted to the thing oneself wanted, and then to carry out treatment and analysis be a more thorny thing.The appearance of web crawlers solves these all problems.

The reptile device of current majority is all the function simply achieving and crawl webpage, but crawls for repeating, be absorbed in all not good embodiment in the aspect such as endless loop trap, formulation anti-creep strategy (extending the crawl time).In addition, current unit network compatibility is bad, can not solve the crawl demand simultaneously capturing multiple website.

Summary of the invention

Based on this, be necessary at the bottom of the existing unit web crawlers grasping mechanism work efficiency for prior art, capture the time short, and the technical matters of polytype website can not be captured simultaneously, a kind of unit crawler capturing method and system are provided.

A kind of unit crawler capturing method, comprising:

Obtain the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;

Obtain at least one strategy, determine at least one crawler capturing parameter according to described strategy;

The rule corresponding with described current type is obtained according to described current type;

Capture web data according to described crawler capturing parameter from described current URL, according to described rule, parsing is carried out to described web data and obtain resolution data.

A kind of unit crawler capturing system, comprising:

Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;

Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;

Rule module, for obtaining the rule corresponding with described current type according to described current type;

Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.

The present invention determines crawler capturing parameter by strategy, to overcome produced problem in crawl process in time, thus increases work efficiency, and extends the crawl time, and adapts to polytype website.

Accompanying drawing explanation

Fig. 1 is the workflow diagram of a kind of unit crawler capturing of the present invention method;

Fig. 2 is the construction module figure of a kind of unit crawler capturing of the present invention system;

Fig. 3 is the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention will be further described in detail.

Be illustrated in figure 1 the workflow diagram of a kind of unit crawler capturing of the present invention method, comprise:

Step 11, obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;

Step 12, obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;

Step 13, obtains the rule corresponding with described current type according to described current type;

Step 14, captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.

Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines different crawler capturing parameters, thus at step 14, adopts and carry out Webpage data capturing through the determined crawler capturing parameter of step 12.Because crawler capturing parameter is determined by the strategy of step 12, therefore, by setting different strategies, to meet different crawl demands, thus can increase work efficiency, extending the crawl time, and adapting to polytype website.

Wherein in an embodiment, in described step 14, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.

By monitoring in the abnormal conditions capturing described web data or occur in analyzing described web data, timely abnormal conditions can be fed back to user, prevent the wasting of resources.

Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.

In the strategy of the present embodiment, seed is absorbed in endless loop processing policy and is used for preventing from repeating to crawl, be absorbed in endless loop trap, and browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy then can extend the crawl time.

Wherein in an embodiment:

Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;

Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;

Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;

Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.

The present embodiment further illustrates that seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and Agent IP switchover policy, wherein, seed is absorbed in endless loop processing policy, browser mark switchover policy and Agent IP switchover policy according to abnormal conditions adjustment crawler capturing parameter, and cookie dynamically updates strategy and then adjusts crawler capturing parameter by the mode of timing renewal.

Specifically, seed is absorbed in endless loop processing policy mainly for the endless loop trap solving website.After reptile grabs web data according to URL, from the URL that this web data analysis makes new advances, and capture new web data according to new URL again.But, some website can arrange endless loop trap, namely the new URL analyzed according to web data is existing URL, thus cause crawler capturing to be absorbed in endless loop, affect crawler capturing, and seed being absorbed in endless loop processing policy, is then when monitoring current URL and being absorbed in the abnormal conditions of endless loop, refusal is then set and captures web data from described current URL, thus avoid being absorbed in endless loop.

Specifically, browser mark switchover policy is used for imitating user behavior as far as possible.The browser that different users uses can be different, in order to imitate user behavior as much as possible, need type or the version of changing browser.And the type of browser or version, browser mark (such as: user-agent) is adopted to identify, reptile can simulate a virtual browser when crawling, distinguish with user-agent, the value of use-agent is determined by the type of browser and version number, and the value changing user-agent is equivalent to have switched browser.Therefore, when detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change browser mark, to extend the crawl time of reptile.

Specifically, cookie dynamically updates strategy and mainly adopts timing update mode to realize, and when namely arriving default timing, then upgrades cookie, upgrades cookie and be equivalent to set up new session with the website of crawled web data, thus can extend the crawl time.

Specifically, Agent IP switchover policy is mainly for the same IP (network address of website to long-time crawl web data, such as: IPv4 address, or IPv6 address, usually adopt: the IPv4 address of XXX.XXX.XXX.XXX) carry out the situation of blocking.For unit crawler capturing, because unit generally only has an IP, the mode of Agent IP is therefore adopted to carry out crawler capturing.Agent IP switchover policy, detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change its Agent IP, to avoid being blocked.

Wherein in an embodiment, in described step 14, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:

If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;

If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;

If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.

The present embodiment is classified to URL, to make dissimilar URL that different rules can be adopted to resolve, thus obtains analysis result more accurately.

Be illustrated in figure 2 the construction module figure of a kind of unit crawler capturing of the present invention system, comprise:

Seed receiver module 201, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;

Policy module 202, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;

Rule module 203, for obtaining the rule corresponding with described current type according to described current type;

Parsing module 204, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.

Wherein in an embodiment, in described parsing module 204, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.

Wherein in an embodiment:

Wherein in an embodiment, in described parsing module 204, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:

Be illustrated in figure 3 the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system, comprise seed generation module 310, handling module 320 and data memory module 330.

The Main Function of seed generation module 310 is for handling module provides seed, and seed can be the URL of website or the SKU of commodity.Seed can be kept in text or database, and handling module can obtain seed from text or database batch.

Each seed must have virtual numbering and the type of a website, the virtual numbering in website can tell that handling module calls corresponding rule file parse documents, and type field mainly illustrates what type this seed belongs to, be details page URL, paging URL or homepage URL.

Handling module 320 is cores of whole unit reptile, and it manages submodule 321, document analyzing sub-module 322, policy management sub-module 323 and exception reporting submodule 323 by rule file and forms.The Main Function of rule file management submodule 321 is the document resolution rules managing all kinds of website, for document analyzing sub-module 322 provides resolution rules.Document analyzing sub-module 322 obtains the rule of each website from rule file management submodule 321, by these rule parsing documents, obtains the interested information of user.Policy management sub-module 323, as the optimization submodule of handling module, can be made up of a series of tactical management chain, by analyzing the crawl flow process of handling module, can be used for preventing from repeating to crawl, be absorbed in endless loop trap and extend the crawl time etc.Exception reporting submodule 324 is used for reporting the various problems of handling module 320 in crawl process, feeds back to user in time, prevents the wasting of resources.

After handling module 320 gets seed, analyze the report information of exception reporting, call corresponding policy management module and requested webpage.Policy management module 323 comprises a series of strategy defined, and is kept in multiple tactful chain.Such as seed is absorbed in the processing policy of endless loop; Browser agent switchover policy; Cookie dynamically updates strategy; Agent IP switchover policy etc.These strategies can guarantee that handling module is more efficient when requested webpage.

After getting info web, by the virtual numbering calling rule file management submodule 321 of seed, obtain corresponding rule file, by document analyzing sub-module 322 parse documents.Each seed has a type field, can tell what content is document analyzing sub-module 322 will resolve, and is such as homepage URL, generally can parses paging URL; If paging URL, then need to parse detail page URL; If details page URL, then can separate out content by Directly solution.The content parsed can separately be preserved, if parse or URL, then needs him to stamp type mark, save separately, follow-up for handling module 320, if what parse is content, then can be kept in database or text, directly for user.

Exception reporting runs through whole crawl flow process, is divided into two kinds, and one belongs to system-level mistake, a kind of mistake belonging to user class.System-level errors should be reported to handling module 320, and handling module, once receive such type of error, can be called corresponding policy management sub-module 323 and carry out optimal grasp process.And user class mistake to be system cannot process, must feed back to user, such as parsing module is resolved content and is made mistakes.

Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be kept in database or document, and this module also can provide data for handling module 320.

Some data needs to save to reuse to system, and some data can use directly to user.

The present invention designs the sub-module of the handling module of unit reptile, and extendability is very good, and adds policy management sub-module and exception reporting submodule, the whole crawl flow process optimized greatly.

The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. a unit crawler capturing method, is characterized in that, comprising:

Step (11), obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;

Step (12), obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;

Step (13), obtains the rule corresponding with described current type according to described current type;

Step (14), captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.

2. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.

3. unit crawler capturing method according to claim 2, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.

4. unit crawler capturing method according to claim 3, is characterized in that:

5. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:

6. a unit crawler capturing system, is characterized in that, comprising:

7. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.

8. unit crawler capturing system according to claim 7, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.

9. unit crawler capturing system according to claim 8, is characterized in that:

10. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data: