CN105243159A

CN105243159A - Visual script editor-based distributed web crawler system

Info

Publication number: CN105243159A
Application number: CN201510713985.7A
Authority: CN
Inventors: 倪时龙; 苏江文; 王秋琳; 陈予言
Original assignee: Fujian Yirong Information Technology Co Ltd
Current assignee: Fujian Yirong Information Technology Co Ltd
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2016-01-13
Anticipated expiration: 2035-10-28
Also published as: CN105243159B

Abstract

The invention provides a visual script editor-based distributed web crawler system, which comprises a visual script editor, a distributed message queue, a task scheduling module, a webpage grasping module, a content processing module and a result storage module, wherein an input is carried out through a visual interface according to a user; the system automatically generates a metadata extraction script, can identify the structure of a target site, and efficiently grasps specific data; the task scheduling module creates an assigning task; the webpage grasping module takes charge of grasping the webpage; the content processing module schedules a corresponding script to convert the webpage into a metadata set, and finally carries out unified processing; and the metadata set is stored through the result storage module. According to the visual script editor-based distributed web crawler system, the crawling efficiency aiming at specific site data can be greatly improved; the labor intensity of a user is relieved; the system resources are saved; and the visual script editor-based distributed web crawler system has good expandability and flexibility, and is suitable for all types of internet sites.

Description

A kind of distributed network crawler system based on visualization script editing machine

Technical field

The present invention relates to technical field of network communication, particularly relate to a kind of distributed network crawler system based on visualization script editing machine.

Background technology

Be born to 20 end of the centurys from internet, internet information obtains and explosively increases, become already one huge, widely distributed, high isomerism, semi-structured, and the information Librarian that dynamic is high.Extract the interested data of people to collect from internet information, web crawlers is born at this point.Since then, crawler technology just gets out of hand, and with it for foundation stone has expedited the emergence of the search engine giant both domestic and external such as Baidu, Google, opens the window of a fan information to common people.

Now, the providing primarily of website and WEB service form of internet information.Website is made up of webpage miscellaneous, the data provided presenting substantially with the HTML (HTML, HypertextMarkupLanguage) of non-structured static state.Because information analysis system directly cannot use HTML, often need that secondary treating is carried out to it and just can extract useful information.WEB service is then the data-interface of relative specification, and can obtain data by special parameter access, WEB service can independently exist, and also can be combined with website.How efficiently and accurately obtain from a large amount of specific website or WEB service specific information more and more pay close attention to by people.This makes the challenge that the web crawlers technological side of responsible network information gathering is finally huge.

Although the many generation development of web crawlers experience, the multiple systems model basically formed.Very ripe solution has been had both at home and abroad to the design of reptile, and come into operation, but those solutions mostly only provide a kind of general service to public users, can not carry out formulating for particular station particular data, the demand miscellaneous of each user cannot be considered.

At internet arena, the reptile of following several main flow is had to design at present:

1. traditional crawler system

Tradition crawler system, need the Web Organization form of software programmers by evaluating objects website of specialty, Javascript logical code on data-interface and the page, writes out corresponding program code or script, realizes going out specific data according to certain rule-based filtering.Clearly, the advantage of this method can extract required data accurately from targeted sites.

But this method has very large defect, general just can adopt when targeted sites quantity is very limited.Reason is, what the HMTL language that internet site uses was unfixing writes specification, needs to write corresponding script to all targeted sites, has added current increasing website and has adopted dynamic load mode, write difficulty and greatly improve.When monitoring the correcting of website, needing to adjust script in time, and redeploying reptile.This greatly improves the human cost in development and maintenance.In addition, this pattern, due to its complicacy, causes extendability and retractility not good, is unfavorable for large-scale distributed deployment.

2. universal distributed crawler system

Universal distributed crawler system, primary structure is scheduling (control), captures and the large foundation composition of contents processing three.Most current internet search engine is all this mode.As: disclose in prior art one " the distributed network crawler system that theme is relevant; ", see that publication number is: CN102646129A, publication date is: the Chinese patent of 2012-08-22, this system comprises: topic links storer, does not complete the hyperlink of crawl for storage system; Controlling vertex, for extracting hyperlink from topic links storer, removing wherein by the hyperlink that system grabs is crossed, then not being distributed to by the hyperlink that system grabs is crossed node of creeping, and controls whether termination system runs; Creeping node, for receiving the hyperlink that Controlling vertex distributes, then downloading the webpage of hyperlink mark, and by web storage in web database; Web database, for depositing the webpage that node of creeping captures; Page analyzer, for the regular up-to-date webpage reading node of creeping and download from web database, content analysis is carried out to webpage, calculate the degree of subject relativity of contained hyperlink in webpage and webpage, then according to degree of subject relativity, relevant hyperlink is stored in topic links storer, the degree of subject relativity of each webpage is stored in web database.This invention is exactly adopt this kind of pattern.Such crawler system has mainly focused in the analysis of url filtering and Web page subject, and contents processing part is substantially all use textual analysis extraction module.

Textual analysis module roughly can be divided into: 1. based on label densities, the text extraction algorithm 2. based on label applications judges that 3. based on the text extracting of the Web page text extracting method 4. view-based access control model web page blocks analytical technology of machine learning.But no matter adopt which kind of algorithm, it can only be used for the extraction of the trunk data such as Web page text and cannot ensure to extract the accuracy of data.These inventive methods preferably for distributed reptile system, but can be confined to the algorithm of dependence, are only applicable to laterally fuzzy data on a large scale and crawl, have birth defect for crawling of particular data.Because it, in order to obtain maximum versatility, sacrifices customization ability, text message can only be extracted from webpage, but cannot isolate the metadata of particular type from text.Illustrate as the commodity price in electric business's Website page, the drug specifications in the network pharmacy page.Secondly, most of textual analysis algorithm relative complex, the script that during a large amount of use, contrast customizes can consume more system resource, causes crawler system hydraulic performance decline.

Summary of the invention

The technical problem to be solved in the present invention, be to provide a kind of distributed network crawler system based on visualization script editing machine, can realize that efficient customization is carried out to a large amount of particular station and crawl crawling of compatible universal website simultaneously, solve the defect that prior art exists; Reduce user's labour intensity, save system resource.

The present invention is achieved in that a kind of distributed network crawler system based on visualization script editing machine, and described system comprises: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;

Described visualization script editing machine, for checking targeted website, and select target website data capture area; The input of user being changed into an execution chain, simultaneously generating corresponding script and stored in a database according to performing chain; This script is script corresponding to targeted website;

Described Distributed Message Queue, for by task scheduling modules, webpage capture module, content processing module and result memory module carry out decoupling zero, and this Distributed Message Queue comprises scheduling queue, captures queue, processing queue and result queue;

Described task scheduling modules, for the running of responsible coordination whole system, read in after the initial URL link in targeted website and user's input information are packaged into task and import described scheduling queue into, and obtain task object from scheduling queue, and after filtering iterative task, be sent to described crawl queue;

Described webpage capture module, for getting URL link from crawl queue, automatically resolving website coding, and becoming UTF-8 to encode the Content Transformation of the website of crawl, the content of encode this UTF-8 and website relevant information forward after packing and deliver to processing queue;

Described content processing module, for getting the web page contents of website from described processing queue, the URL matched rule using visualization script editing machine to generate mates the URL link of this webpage, if the coupling of finding, calls script corresponding to this URL matched rule and resolves this web page contents; Result after resolving is imported in result queue;

Described result memory module, for taking out result data from result queue, and carries out unifying process screening, then stored in database according to the predefined configuration of system by result data.

Further, described system also comprises monitoring module, scheduling queue in the real-time monitoring distributed message queue of described monitoring module, capture queue, processing queue, whether result queue's four queues make mistakes, when an abnormality is discovered, timely PUSH message to the user interface of system, reminding user inspection make mistakes reason and whether re-start script input.

Further, described system also comprises text extraction module, when the domain name coupling of webpage is less than during with script in described database, call described text extraction module, carry out the corresponding script of extraction webpage, described text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology to extract.

Further, call this script described in resolve; If what generate after dissection process is new URL link, then imports new URL link into described scheduling queue, re-execute task scheduling modules; If after dissection process be result data, then the result data after parsing is imported in result queue.

Further, described task scheduling modules comprises url filtering module and rate manager, described url filtering module, Bloom filter is used to carry out duplicate removal to URL link, prevent from repeating to crawl same URL link, Bloom filter is made up of a binary vector and a series of random mapping function, for retrieving an element whether in a set; Described rate manager, adopt token bucket algorithm to prevent network congestion, the flow of network is flowed out in restriction, and flow is outwards sent with uniform speed, ensures the stability of system.

Further, described webpage capture module comprises: proxy access module and browser analog module, described proxy access module, the IP agency preset is used to conduct interviews according to user configuration information to the URL link of specifying, prevent described webpage capture module place server ip from being blocked by targeted website because of visit capacity is excessive, described browser analog module, WebKit is used to increase income browser engine to resolve targeted website, the Javascript code on the page can be performed, generate the complete page of targeted website.

Further, described execution chain comprises several subparameters, and subparameter has multiple choices, and the selection of subparameter comprises: lower floor's URL link selection rule, metadata selected mark or the executable scripted code of system.

Further, described visualization script editing machine specific implementation flow process is as follows:

Step 1, in visualization script editor interface, input URL link address, targeted website;

Step 2, visualization script editing machine present targeted website URL link web page contents,

If step 3 does not need the URL link of the lower floor entering this webpage, then enter step 5, if need to enter lower floor's URL link, enter step 4;

The block of step 4, selection lower floor URL link, visualization script editing machine will record the position of these blocks, and perform chain stored in one, and all positional informations all with the form of CSS or XPATH grammer composition, return step 3;

Step 5, select several to need to capture the block of content, perform chain stored in one,

Step 6, user confirm that editing completes,

Step 7, visualization script editing machine import the execution chain recorded into script generator, produce the crawl script of corresponding targeted website, simultaneously for advanced level user, provide additional interface, user, by writing the code of compatible system, directly embeds among described crawl script;

Step 8, by script stored in database.

Tool of the present invention has the following advantages: the visualization script editing machine of native system, the capture area of unprofessional user's select target website related data intuitively can be made, the operation of user is transformed automatically and is generated as specific processing scripts, be in operation by each distributed processors unit in crawler system and these processing scripts are dynamically preferentially performed, greatly reduce the human cost customized needed for reptile, improve the operational efficiency of crawler system simultaneously.The accuracy rate of these system grabs data is high, and has high extensibility and retractility.

Accompanying drawing explanation

Fig. 1 is the structural representation of present system.

Fig. 2 is the workflow diagram of present system.

Fig. 3 is visualization script editing machine execution architecture schematic diagram of the present invention.

Fig. 4 is visualization script editing machine workflow schematic diagram of the present invention.

Fig. 5 is the process flow diagram that the present invention performs chain function mode.

Fig. 6 is the workflow diagram of content processing module of the present invention and script.

Fig. 7 is the structural representation of present system one embodiment.

Embodiment

Refer to shown in Fig. 1 to Fig. 7, a kind of distributed network crawler system based on visualization script editing machine of the present invention, described system comprises: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;

Described visualization script editing machine, for visual check targeted website content, the data grabber region of select target website; Its input by user is (from input targeted sites URL link, input to all user operations finally completed in editor produce) change into an execution chain and whether other inessential parameters (such as use text to extract, whether simulation browser etc.), generate corresponding script and stored in a database according to performing chain simultaneously; This visualization script editing machine makes user without the need to possessing programming skill, can check targeted website as normal browsing webpage.This script is script corresponding to targeted website;

Configuration Manager, WEB interface is provided, user can configure the website needing to crawl here, and for one or a series of website configuration schedules strategy (as: priority, timing crawls, and heavily climbs interval etc.), capture strategy (retry of makeing mistakes, enable agency, enable visit device simulation etc.) and other configuration parameters, form user configuration information.

Described Distributed Message Queue, for by task scheduling modules, webpage capture module, content processing module and result memory module carry out decoupling zero, achieve high distributed deployment ability.This Distributed Message Queue comprises scheduling queue, captures queue, processing queue and result queue;

Described task scheduling modules, for the running of responsible coordination whole system, read in targeted website (this targeted website is and will carries out processing the website judged) initial URL link and user's input information and import described scheduling queue into after being packaged into task, and obtain task object from scheduling queue, and after filtering iterative task, be sent to described crawl queue; Described task scheduling modules comprises url filtering module and rate manager, described url filtering module, Bloom filter is used to carry out duplicate removal to URL link, prevent from repeating to crawl same URL link, Bloom filter is actually and is made up of a very long binary vector and a series of random mapping function, for retrieving an element whether in a set; Its advantage be space efficiency and query time all considerably beyond general algorithm, shortcoming has certain false recognition rate and deletes difficulty.Use Bloom filter can improve system effectiveness greatly, and its shortcoming can not have an impact completely to crawler system, very applicable crawler system uses.Described rate manager, adopt token bucket algorithm to prevent network congestion, the flow of network is flowed out in restriction, and flow is outwards sent with uniform speed, ensures the stability of system.

Described webpage capture module, for getting URL link from crawl queue, automatically resolving website coding, and becoming UTF-8 to encode the Content Transformation of the website of crawl, the content of encode this UTF-8 and website relevant information forward after packing and deliver to processing queue; Described webpage capture module comprises: proxy access module and browser analog module, described proxy access module, along with the development of network technology, nowadays increasing website adopts dynamic page technology, employ a large amount of Javascript scripts to generate web page contents, and the webpage capture of traditional mode can only obtain the source code of the page, cannot perform Javascript script, cause the complete page that cannot obtain targeted sites, the difficulty multiplication that data are extracted.Proxy access module of the present invention can use the IP agency preset to conduct interviews to the URL link of specifying according to user configuration information, prevent described webpage capture module place server ip from being blocked by targeted website because of visit capacity is excessive, described browser analog module, WebKit is used to increase income browser engine to resolve targeted website, the Javascript code on the page can be performed, generate the complete page of targeted website.

Described content processing module, for getting the web page contents of website from described processing queue, if the URL link of this webpage and predefined URL matched rule match, (user is in advance in the targeted sites URL link that visualization script editing machine inputs, the condition intelligence arranged according to user by visual editor generates a URL matched rule), then the web page contents of script to website calling this URL link of coupling is resolved; Result after resolving is imported in result queue; Described result memory module, for taking out result data from result queue, and carries out unifying process screening, then stored in database according to the predefined configuration of system by result data.

Wherein, described system also comprises monitoring module and text extraction module, scheduling queue in the real-time monitoring distributed message queue of described monitoring module, capture queue, processing queue, whether result queue's four queues make mistakes, when an abnormality is discovered, timely PUSH message to the user interface of system, reminding user inspection make mistakes reason and whether re-start script input.

When the domain name coupling of webpage is less than during with script in described database, calls described text extraction module, carry out the corresponding script of extractions webpage, the text extracting mode of described text extraction module use view-based access control model web page blocks analytical technology is extracted.

In the present invention, call this script described in resolve; If what generate after dissection process is new URL link, then imports new URL link into described scheduling queue, re-execute task scheduling modules; If after dissection process be result data, then the result data after parsing is imported in result queue.

Described execution chain comprises several subparameters, and subparameter has multiple choices, and the selection of subparameter comprises: lower floor's URL link selection rule, metadata selected mark (form is as CSS, XPATH selector switch) or the executable scripted code of system.

As shown in Fig. 3,4,5, described visualization script editing machine specific implementation flow process is as follows:

The block of step 4, selection lower floor URL link, visualization script editing machine will record the position of these blocks, and perform chain stored in one, (visualization script editing machine will record the position of these blocks, and perform chain concrete operations stored in one can see Fig. 5) all positional informations all with the form of CSS or XPATH grammer composition, return step 3;

Step 6, user confirm that editing completes,

Step 8, by script stored in database.

If Fig. 2 is the workflow diagram of present system, specific as follows:

(1) task scheduling modules access configuration administration module, reads in after initial URL link and user configuration information are packaged into task and imports scheduling queue into.

(2) task scheduling modules obtains task object from scheduling queue, and inquiry url filtering module, if do not access the URL link of this task, was then directly sent to crawl queue.If accessed, then detected the parameter (repay time etc.) that user is arranged, and if allow again to access, be also sent to crawl queue, otherwise abandon this task.Task after finally filter being weighed is sent to crawl queue.

(3) webpage capture module gets URL link from crawl queue, performs grasping manipulation, automatically resolves website coding, and the content of crawl is changed into the forwarding of packing of general UTF-8 coding and website relevant information and deliver to processing queue.

(4) content processing module gets web page contents from processing queue.If the information matches such as the domain name of this webpage have arrived the script that user pre-defines (script namely in database), then call this script and resolved.If what generate after process is new URL link, these links will import scheduling queue into, reenter step (2) if result data, then import result queue into.

(5) result memory module takes out result from result queue, does according to predetermined configuration, does final unified process, database of restoring.

(6) (2) ~ (5) are repeated until the system that receives is ceased and desisted order.

As the structural representation that Fig. 7 is present system one embodiment.This modules of the present invention all can be disposed with the many examples of unit, multimachine list example, the many way of example of multimachine.Namely system of the present invention can distributed deployment.

In addition, native system be sent to message queue data object be collectively referred to as task object, a task object comprises: 1. content (URL link, web page contents or result data etc. change according to the difference of message queue); 2. configuration parameter; 3. status indicator;

In fact be all first take out task object from message queue, then take out relevant information from task object.

Here it should be noted that: in the present invention, task scheduling modules, webpage capture module, content processing module and result memory module all can start Multi-instance on multiple servers, they realize decoupling zero by message queue, can stop or increasing the example of any type at any time.This kind of design can in the extendability of maximum elevator system and retractility.

In a word, the present invention is inputted by visualization interface according to user, system generates meta-data extraction script automatically, the structure of targeted sites can be identified, capture specific data efficiently, create assigned tasks by task scheduling modules, webpage capture module in charge captures the page, it is metadata set that content processing module transfers corresponding script by conversion of page, and finally unified process, is stored by result memory module.The present invention significantly can improve and crawls efficiency for particular station data, reduces user's labour intensity, saves system resource, and have good extensibility and retractility, be applicable to all types of internet sites.

The foregoing is only preferred embodiment of the present invention, all equalizations done according to the present patent application the scope of the claims change and modify, and all should belong to covering scope of the present invention.

Claims

1. based on a distributed network crawler system for visualization script editing machine, it is characterized in that: described system comprises: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;

Described content processing module, for getting the web page contents of website from described processing queue, if the URL link of this webpage and predefined URL matched rule match, then the web page contents of script to website calling this URL matched rule of coupling corresponding is resolved; Result after resolving is imported in result queue;

2. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described system also comprises monitoring module, scheduling queue in the real-time monitoring distributed message queue of described monitoring module, capture queue, processing queue, whether result queue's four queues make mistakes, when an abnormality is discovered, timely PUSH message to the user interface of system, reminding user inspection make mistakes reason and whether re-start script input.

3. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described system also comprises text extraction module, when the domain name coupling of webpage is less than during with script in described database, call described text extraction module, carry out the corresponding script of extraction webpage, described text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology to extract.

4. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, is characterized in that: described in call this script and resolve; If what generate after dissection process is new URL link, then imports new URL link into described scheduling queue, re-execute task scheduling modules; If after dissection process be result data, then the result data after parsing is imported in result queue.

5. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described task scheduling modules comprises url filtering module and rate manager, described url filtering module, Bloom filter is used to carry out duplicate removal to URL link, prevent from repeating to crawl same URL link, Bloom filter is made up of a binary vector and a series of random mapping function, for retrieving an element whether in a set; Described rate manager, adopt token bucket algorithm to prevent network congestion, the flow of network is flowed out in restriction, and flow is outwards sent with uniform speed, ensures the stability of system.

6. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described webpage capture module comprises: proxy access module and browser analog module, described proxy access module, the IP agency preset is used to conduct interviews according to user configuration information to the URL link of specifying, prevent described webpage capture module place server ip from being blocked by targeted website because of visit capacity is excessive, described browser analog module, WebKit is used to increase income browser engine to resolve targeted website, the Javascript code on the page can be performed, generate the complete page of targeted website.

7. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described execution chain comprises several subparameters, subparameter has multiple choices, the selection of subparameter comprises: lower floor's URL link selection rule, metadata selected mark or the executable scripted code of system.

8. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, is characterized in that: described visualization script editing machine specific implementation flow process is as follows:

Step 6, user confirm that editing completes,

Step 8, by script stored in database.