CN104252530A - Single-computer crawler grabbing method and system - Google Patents
Single-computer crawler grabbing method and system Download PDFInfo
- Publication number
- CN104252530A CN104252530A CN201410458191.6A CN201410458191A CN104252530A CN 104252530 A CN104252530 A CN 104252530A CN 201410458191 A CN201410458191 A CN 201410458191A CN 104252530 A CN104252530 A CN 104252530A
- Authority
- CN
- China
- Prior art keywords
- url
- web data
- capturing
- data
- described current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 235000014510 cooky Nutrition 0.000 claims description 27
- 230000002159 abnormal effect Effects 0.000 claims description 26
- 108010016634 Seed Storage Proteins Proteins 0.000 claims description 8
- 230000008569 process Effects 0.000 abstract description 8
- 238000004458 analytical method Methods 0.000 abstract description 3
- 241000270322 Lepidosauria Species 0.000 description 6
- 238000010276 construction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a single-computer crawler grabbing method and system. The single-computer crawler grabbing method includes acquiring at least one seed including a URL (uniform resource locator), a website number and a type, taking the URLs of the seeds as current URLs, taking the website numbers of the seeds as current website numbers, and taking the types of the seeds as current types; acquiring at least one strategy, and determining at least one crawler grabbing parameter according to the strategies; acquiring rules corresponding to the current types according to the current types; grabbing website data from the current URLs according to the crawler grabbing parameters, and analyzing the website data according to the rules to acquire analysis data. The crawler grabbing parameters are determined through the strategies so as to solve the problems in the process of grabbing, so that working efficiency is improved, grabbing time is increased, and the single-computer crawler grabbing method and system is suitable for websites of various types.
Description
Technical field
The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.
Background technology
Internet has data and the information of magnanimity, how these data and information is converted to the thing oneself wanted, and then to carry out treatment and analysis be a more thorny thing.The appearance of web crawlers solves these all problems.
The reptile device of current majority is all the function simply achieving and crawl webpage, but crawls for repeating, be absorbed in all not good embodiment in the aspect such as endless loop trap, formulation anti-creep strategy (extending the crawl time).In addition, current unit network compatibility is bad, can not solve the crawl demand simultaneously capturing multiple website.
Summary of the invention
Based on this, be necessary at the bottom of the existing unit web crawlers grasping mechanism work efficiency for prior art, capture the time short, and the technical matters of polytype website can not be captured simultaneously, a kind of unit crawler capturing method and system are provided.
A kind of unit crawler capturing method, comprising:
Obtain the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Obtain at least one strategy, determine at least one crawler capturing parameter according to described strategy;
The rule corresponding with described current type is obtained according to described current type;
Capture web data according to described crawler capturing parameter from described current URL, according to described rule, parsing is carried out to described web data and obtain resolution data.
A kind of unit crawler capturing system, comprising:
Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module, for obtaining the rule corresponding with described current type according to described current type;
Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
The present invention determines crawler capturing parameter by strategy, to overcome produced problem in crawl process in time, thus increases work efficiency, and extends the crawl time, and adapts to polytype website.
Accompanying drawing explanation
Fig. 1 is the workflow diagram of a kind of unit crawler capturing of the present invention method;
Fig. 2 is the construction module figure of a kind of unit crawler capturing of the present invention system;
Fig. 3 is the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention will be further described in detail.
Be illustrated in figure 1 the workflow diagram of a kind of unit crawler capturing of the present invention method, comprise:
Step 11, obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Step 12, obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Step 13, obtains the rule corresponding with described current type according to described current type;
Step 14, captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.
Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines different crawler capturing parameters, thus at step 14, adopts and carry out Webpage data capturing through the determined crawler capturing parameter of step 12.Because crawler capturing parameter is determined by the strategy of step 12, therefore, by setting different strategies, to meet different crawl demands, thus can increase work efficiency, extending the crawl time, and adapting to polytype website.
Wherein in an embodiment, in described step 14, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
By monitoring in the abnormal conditions capturing described web data or occur in analyzing described web data, timely abnormal conditions can be fed back to user, prevent the wasting of resources.
Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
In the strategy of the present embodiment, seed is absorbed in endless loop processing policy and is used for preventing from repeating to crawl, be absorbed in endless loop trap, and browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy then can extend the crawl time.
Wherein in an embodiment:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
The present embodiment further illustrates that seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and Agent IP switchover policy, wherein, seed is absorbed in endless loop processing policy, browser mark switchover policy and Agent IP switchover policy according to abnormal conditions adjustment crawler capturing parameter, and cookie dynamically updates strategy and then adjusts crawler capturing parameter by the mode of timing renewal.
Specifically, seed is absorbed in endless loop processing policy mainly for the endless loop trap solving website.After reptile grabs web data according to URL, from the URL that this web data analysis makes new advances, and capture new web data according to new URL again.But, some website can arrange endless loop trap, namely the new URL analyzed according to web data is existing URL, thus cause crawler capturing to be absorbed in endless loop, affect crawler capturing, and seed being absorbed in endless loop processing policy, is then when monitoring current URL and being absorbed in the abnormal conditions of endless loop, refusal is then set and captures web data from described current URL, thus avoid being absorbed in endless loop.
Specifically, browser mark switchover policy is used for imitating user behavior as far as possible.The browser that different users uses can be different, in order to imitate user behavior as much as possible, need type or the version of changing browser.And the type of browser or version, browser mark (such as: user-agent) is adopted to identify, reptile can simulate a virtual browser when crawling, distinguish with user-agent, the value of use-agent is determined by the type of browser and version number, and the value changing user-agent is equivalent to have switched browser.Therefore, when detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change browser mark, to extend the crawl time of reptile.
Specifically, cookie dynamically updates strategy and mainly adopts timing update mode to realize, and when namely arriving default timing, then upgrades cookie, upgrades cookie and be equivalent to set up new session with the website of crawled web data, thus can extend the crawl time.
Specifically, Agent IP switchover policy is mainly for the same IP (network address of website to long-time crawl web data, such as: IPv4 address, or IPv6 address, usually adopt: the IPv4 address of XXX.XXX.XXX.XXX) carry out the situation of blocking.For unit crawler capturing, because unit generally only has an IP, the mode of Agent IP is therefore adopted to carry out crawler capturing.Agent IP switchover policy, detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change its Agent IP, to avoid being blocked.
Wherein in an embodiment, in described step 14, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
The present embodiment is classified to URL, to make dissimilar URL that different rules can be adopted to resolve, thus obtains analysis result more accurately.
Be illustrated in figure 2 the construction module figure of a kind of unit crawler capturing of the present invention system, comprise:
Seed receiver module 201, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module 202, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module 203, for obtaining the rule corresponding with described current type according to described current type;
Parsing module 204, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
Wherein in an embodiment, in described parsing module 204, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
Wherein in an embodiment:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
Wherein in an embodiment, in described parsing module 204, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
Be illustrated in figure 3 the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system, comprise seed generation module 310, handling module 320 and data memory module 330.
The Main Function of seed generation module 310 is for handling module provides seed, and seed can be the URL of website or the SKU of commodity.Seed can be kept in text or database, and handling module can obtain seed from text or database batch.
Each seed must have virtual numbering and the type of a website, the virtual numbering in website can tell that handling module calls corresponding rule file parse documents, and type field mainly illustrates what type this seed belongs to, be details page URL, paging URL or homepage URL.
Handling module 320 is cores of whole unit reptile, and it manages submodule 321, document analyzing sub-module 322, policy management sub-module 323 and exception reporting submodule 323 by rule file and forms.The Main Function of rule file management submodule 321 is the document resolution rules managing all kinds of website, for document analyzing sub-module 322 provides resolution rules.Document analyzing sub-module 322 obtains the rule of each website from rule file management submodule 321, by these rule parsing documents, obtains the interested information of user.Policy management sub-module 323, as the optimization submodule of handling module, can be made up of a series of tactical management chain, by analyzing the crawl flow process of handling module, can be used for preventing from repeating to crawl, be absorbed in endless loop trap and extend the crawl time etc.Exception reporting submodule 324 is used for reporting the various problems of handling module 320 in crawl process, feeds back to user in time, prevents the wasting of resources.
After handling module 320 gets seed, analyze the report information of exception reporting, call corresponding policy management module and requested webpage.Policy management module 323 comprises a series of strategy defined, and is kept in multiple tactful chain.Such as seed is absorbed in the processing policy of endless loop; Browser agent switchover policy; Cookie dynamically updates strategy; Agent IP switchover policy etc.These strategies can guarantee that handling module is more efficient when requested webpage.
After getting info web, by the virtual numbering calling rule file management submodule 321 of seed, obtain corresponding rule file, by document analyzing sub-module 322 parse documents.Each seed has a type field, can tell what content is document analyzing sub-module 322 will resolve, and is such as homepage URL, generally can parses paging URL; If paging URL, then need to parse detail page URL; If details page URL, then can separate out content by Directly solution.The content parsed can separately be preserved, if parse or URL, then needs him to stamp type mark, save separately, follow-up for handling module 320, if what parse is content, then can be kept in database or text, directly for user.
Exception reporting runs through whole crawl flow process, is divided into two kinds, and one belongs to system-level mistake, a kind of mistake belonging to user class.System-level errors should be reported to handling module 320, and handling module, once receive such type of error, can be called corresponding policy management sub-module 323 and carry out optimal grasp process.And user class mistake to be system cannot process, must feed back to user, such as parsing module is resolved content and is made mistakes.
Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be kept in database or document, and this module also can provide data for handling module 320.
Some data needs to save to reuse to system, and some data can use directly to user.
The present invention designs the sub-module of the handling module of unit reptile, and extendability is very good, and adds policy management sub-module and exception reporting submodule, the whole crawl flow process optimized greatly.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.
Claims (10)
1. a unit crawler capturing method, is characterized in that, comprising:
Step (11), obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Step (12), obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Step (13), obtains the rule corresponding with described current type according to described current type;
Step (14), captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.
2. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
3. unit crawler capturing method according to claim 2, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
4. unit crawler capturing method according to claim 3, is characterized in that:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
5. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
6. a unit crawler capturing system, is characterized in that, comprising:
Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module, for obtaining the rule corresponding with described current type according to described current type;
Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
7. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
8. unit crawler capturing system according to claim 7, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
9. unit crawler capturing system according to claim 8, is characterized in that:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
10. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410458191.6A CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410458191.6A CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104252530A true CN104252530A (en) | 2014-12-31 |
CN104252530B CN104252530B (en) | 2017-09-15 |
Family
ID=52187420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410458191.6A Active CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104252530B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN105989151A (en) * | 2015-03-02 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Webpage crawling method and apparatus |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
CN106599270A (en) * | 2016-12-23 | 2017-04-26 | 浙江省公众信息产业有限公司 | Network data capturing method and crawler |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN107451046A (en) * | 2016-05-30 | 2017-12-08 | 腾讯科技(深圳)有限公司 | A kind of method and terminal for detecting thread |
CN107957939A (en) * | 2016-10-14 | 2018-04-24 | 北京京东尚科信息技术有限公司 | Webpage interactive interface test method and system |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | A data acquisition method, system and storage medium based on Scrapy framework |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
CN112541106A (en) * | 2020-12-19 | 2021-03-23 | 广州市创乐信息技术有限公司 | Network data acquisition method and device, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN102254014A (en) * | 2011-07-21 | 2011-11-23 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
CN102347930A (en) * | 2010-07-26 | 2012-02-08 | 中国电信股份有限公司 | Method and system for obtaining webpage content |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
-
2014
- 2014-09-10 CN CN201410458191.6A patent/CN104252530B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN102347930A (en) * | 2010-07-26 | 2012-02-08 | 中国电信股份有限公司 | Method and system for obtaining webpage content |
CN102254014A (en) * | 2011-07-21 | 2011-11-23 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105989151B (en) * | 2015-03-02 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Webpage capture method and device |
CN105989151A (en) * | 2015-03-02 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Webpage crawling method and apparatus |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
CN106021257B (en) * | 2015-12-31 | 2019-10-18 | 广州华多网络科技有限公司 | A kind of crawler capturing data method, apparatus and system for supporting online programming |
CN107045507B (en) * | 2016-02-05 | 2020-08-21 | 北京国双科技有限公司 | Webpage crawling method and device |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN107451046A (en) * | 2016-05-30 | 2017-12-08 | 腾讯科技(深圳)有限公司 | A kind of method and terminal for detecting thread |
CN107451046B (en) * | 2016-05-30 | 2020-11-17 | 腾讯科技(深圳)有限公司 | Method and terminal for detecting threads |
CN107957939A (en) * | 2016-10-14 | 2018-04-24 | 北京京东尚科信息技术有限公司 | Webpage interactive interface test method and system |
CN107957939B (en) * | 2016-10-14 | 2021-02-26 | 北京京东尚科信息技术有限公司 | Webpage interaction interface testing method and system |
CN106599270A (en) * | 2016-12-23 | 2017-04-26 | 浙江省公众信息产业有限公司 | Network data capturing method and crawler |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | A data acquisition method, system and storage medium based on Scrapy framework |
CN112541106A (en) * | 2020-12-19 | 2021-03-23 | 广州市创乐信息技术有限公司 | Network data acquisition method and device, computer equipment and storage medium |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
Also Published As
Publication number | Publication date |
---|---|
CN104252530B (en) | 2017-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104252530A (en) | Single-computer crawler grabbing method and system | |
US10084815B2 (en) | Remediating computer security threats using distributed sensor computers | |
US9124622B1 (en) | Detecting computer security threats in electronic documents based on structure | |
CN105956175A (en) | Webpage content crawling method and device | |
CN103533097A (en) | Web crawler downloading and analyzing method and device | |
CN104298782B (en) | Internet user actively accesses the analysis method of action trail | |
EP2857987A1 (en) | Acquiring method, device and system of user behavior | |
CN103581909B (en) | The localization method of a kind of doubtful mobile phone Malware and device thereof | |
US20130262659A1 (en) | Measuring Web Browsing Quality of Experience in Real-Time at an Intermediate Network Node | |
Claffy | Tracking IPv6 evolution: data we have and data we need | |
CN104010051B (en) | A kind of method and management server for accessing network | |
WO2016191037A1 (en) | Detecting computer security threats in electronic documents based on structure | |
CN105516114B (en) | Method and device for scanning vulnerability based on webpage hash value and electronic equipment | |
US20160149948A1 (en) | Automated Cyber Threat Mitigation Coordinator | |
US9055113B2 (en) | Method and system for monitoring flows in network traffic | |
CN107580052A (en) | From the network self-adapting reptile method and system of evolution | |
Li et al. | Classifying HTTP traffic in the new age | |
JP6286559B2 (en) | Method and device for adding sign icons in interactive applications | |
CN104462242B (en) | Webpage capacity of returns statistical method and device | |
EP3789890A1 (en) | Fully qualified domain name (fqdn) determination | |
CN104509066B (en) | A kind of method and the network equipment, management server for accessing network | |
KR102314557B1 (en) | System for managing security control and method thereof | |
CN110598146A (en) | SPA application program implementation method and device based on Reactjs | |
CN102726026B (en) | A kind of acquisition methods of user behavior, equipment and system | |
CN103117892B (en) | Add method and the device of website visiting record |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |