CN105243159A - Visual script editor-based distributed web crawler system - Google Patents
Visual script editor-based distributed web crawler system Download PDFInfo
- Publication number
- CN105243159A CN105243159A CN201510713985.7A CN201510713985A CN105243159A CN 105243159 A CN105243159 A CN 105243159A CN 201510713985 A CN201510713985 A CN 201510713985A CN 105243159 A CN105243159 A CN 105243159A
- Authority
- CN
- China
- Prior art keywords
- module
- queue
- script
- webpage
- url link
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a visual script editor-based distributed web crawler system, which comprises a visual script editor, a distributed message queue, a task scheduling module, a webpage grasping module, a content processing module and a result storage module, wherein an input is carried out through a visual interface according to a user; the system automatically generates a metadata extraction script, can identify the structure of a target site, and efficiently grasps specific data; the task scheduling module creates an assigning task; the webpage grasping module takes charge of grasping the webpage; the content processing module schedules a corresponding script to convert the webpage into a metadata set, and finally carries out unified processing; and the metadata set is stored through the result storage module. According to the visual script editor-based distributed web crawler system, the crawling efficiency aiming at specific site data can be greatly improved; the labor intensity of a user is relieved; the system resources are saved; and the visual script editor-based distributed web crawler system has good expandability and flexibility, and is suitable for all types of internet sites.
Description
Technical field
The present invention relates to technical field of network communication, particularly relate to a kind of distributed network crawler system based on visualization script editing machine.
Background technology
Be born to 20 end of the centurys from internet, internet information obtains and explosively increases, become already one huge, widely distributed, high isomerism, semi-structured, and the information Librarian that dynamic is high.Extract the interested data of people to collect from internet information, web crawlers is born at this point.Since then, crawler technology just gets out of hand, and with it for foundation stone has expedited the emergence of the search engine giant both domestic and external such as Baidu, Google, opens the window of a fan information to common people.
Now, the providing primarily of website and WEB service form of internet information.Website is made up of webpage miscellaneous, the data provided presenting substantially with the HTML (HTML, HypertextMarkupLanguage) of non-structured static state.Because information analysis system directly cannot use HTML, often need that secondary treating is carried out to it and just can extract useful information.WEB service is then the data-interface of relative specification, and can obtain data by special parameter access, WEB service can independently exist, and also can be combined with website.How efficiently and accurately obtain from a large amount of specific website or WEB service specific information more and more pay close attention to by people.This makes the challenge that the web crawlers technological side of responsible network information gathering is finally huge.
Although the many generation development of web crawlers experience, the multiple systems model basically formed.Very ripe solution has been had both at home and abroad to the design of reptile, and come into operation, but those solutions mostly only provide a kind of general service to public users, can not carry out formulating for particular station particular data, the demand miscellaneous of each user cannot be considered.
At internet arena, the reptile of following several main flow is had to design at present:
1. traditional crawler system
Tradition crawler system, need the Web Organization form of software programmers by evaluating objects website of specialty, Javascript logical code on data-interface and the page, writes out corresponding program code or script, realizes going out specific data according to certain rule-based filtering.Clearly, the advantage of this method can extract required data accurately from targeted sites.
But this method has very large defect, general just can adopt when targeted sites quantity is very limited.Reason is, what the HMTL language that internet site uses was unfixing writes specification, needs to write corresponding script to all targeted sites, has added current increasing website and has adopted dynamic load mode, write difficulty and greatly improve.When monitoring the correcting of website, needing to adjust script in time, and redeploying reptile.This greatly improves the human cost in development and maintenance.In addition, this pattern, due to its complicacy, causes extendability and retractility not good, is unfavorable for large-scale distributed deployment.
2. universal distributed crawler system
Universal distributed crawler system, primary structure is scheduling (control), captures and the large foundation composition of contents processing three.Most current internet search engine is all this mode.As: disclose in prior art one " the distributed network crawler system that theme is relevant; ", see that publication number is: CN102646129A, publication date is: the Chinese patent of 2012-08-22, this system comprises: topic links storer, does not complete the hyperlink of crawl for storage system; Controlling vertex, for extracting hyperlink from topic links storer, removing wherein by the hyperlink that system grabs is crossed, then not being distributed to by the hyperlink that system grabs is crossed node of creeping, and controls whether termination system runs; Creeping node, for receiving the hyperlink that Controlling vertex distributes, then downloading the webpage of hyperlink mark, and by web storage in web database; Web database, for depositing the webpage that node of creeping captures; Page analyzer, for the regular up-to-date webpage reading node of creeping and download from web database, content analysis is carried out to webpage, calculate the degree of subject relativity of contained hyperlink in webpage and webpage, then according to degree of subject relativity, relevant hyperlink is stored in topic links storer, the degree of subject relativity of each webpage is stored in web database.This invention is exactly adopt this kind of pattern.Such crawler system has mainly focused in the analysis of url filtering and Web page subject, and contents processing part is substantially all use textual analysis extraction module.
Textual analysis module roughly can be divided into: 1. based on label densities, the text extraction algorithm 2. based on label applications judges that 3. based on the text extracting of the Web page text extracting method 4. view-based access control model web page blocks analytical technology of machine learning.But no matter adopt which kind of algorithm, it can only be used for the extraction of the trunk data such as Web page text and cannot ensure to extract the accuracy of data.These inventive methods preferably for distributed reptile system, but can be confined to the algorithm of dependence, are only applicable to laterally fuzzy data on a large scale and crawl, have birth defect for crawling of particular data.Because it, in order to obtain maximum versatility, sacrifices customization ability, text message can only be extracted from webpage, but cannot isolate the metadata of particular type from text.Illustrate as the commodity price in electric business's Website page, the drug specifications in the network pharmacy page.Secondly, most of textual analysis algorithm relative complex, the script that during a large amount of use, contrast customizes can consume more system resource, causes crawler system hydraulic performance decline.
Summary of the invention
The technical problem to be solved in the present invention, be to provide a kind of distributed network crawler system based on visualization script editing machine, can realize that efficient customization is carried out to a large amount of particular station and crawl crawling of compatible universal website simultaneously, solve the defect that prior art exists; Reduce user's labour intensity, save system resource.
The present invention is achieved in that a kind of distributed network crawler system based on visualization script editing machine, and described system comprises: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;
Described visualization script editing machine, for checking targeted website, and select target website data capture area; The input of user being changed into an execution chain, simultaneously generating corresponding script and stored in a database according to performing chain; This script is script corresponding to targeted website;
Described Distributed Message Queue, for by task scheduling modules, webpage capture module, content processing module and result memory module carry out decoupling zero, and this Distributed Message Queue comprises scheduling queue, captures queue, processing queue and result queue;
Described task scheduling modules, for the running of responsible coordination whole system, read in after the initial URL link in targeted website and user's input information are packaged into task and import described scheduling queue into, and obtain task object from scheduling queue, and after filtering iterative task, be sent to described crawl queue;
Described webpage capture module, for getting URL link from crawl queue, automatically resolving website coding, and becoming UTF-8 to encode the Content Transformation of the website of crawl, the content of encode this UTF-8 and website relevant information forward after packing and deliver to processing queue;
Described content processing module, for getting the web page contents of website from described processing queue, the URL matched rule using visualization script editing machine to generate mates the URL link of this webpage, if the coupling of finding, calls script corresponding to this URL matched rule and resolves this web page contents; Result after resolving is imported in result queue;
Described result memory module, for taking out result data from result queue, and carries out unifying process screening, then stored in database according to the predefined configuration of system by result data.
Further, described system also comprises monitoring module, scheduling queue in the real-time monitoring distributed message queue of described monitoring module, capture queue, processing queue, whether result queue's four queues make mistakes, when an abnormality is discovered, timely PUSH message to the user interface of system, reminding user inspection make mistakes reason and whether re-start script input.
Further, described system also comprises text extraction module, when the domain name coupling of webpage is less than during with script in described database, call described text extraction module, carry out the corresponding script of extraction webpage, described text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology to extract.
Further, call this script described in resolve; If what generate after dissection process is new URL link, then imports new URL link into described scheduling queue, re-execute task scheduling modules; If after dissection process be result data, then the result data after parsing is imported in result queue.
Further, described task scheduling modules comprises url filtering module and rate manager, described url filtering module, Bloom filter is used to carry out duplicate removal to URL link, prevent from repeating to crawl same URL link, Bloom filter is made up of a binary vector and a series of random mapping function, for retrieving an element whether in a set; Described rate manager, adopt token bucket algorithm to prevent network congestion, the flow of network is flowed out in restriction, and flow is outwards sent with uniform speed, ensures the stability of system.
Further, described webpage capture module comprises: proxy access module and browser analog module, described proxy access module, the IP agency preset is used to conduct interviews according to user configuration information to the URL link of specifying, prevent described webpage capture module place server ip from being blocked by targeted website because of visit capacity is excessive, described browser analog module, WebKit is used to increase income browser engine to resolve targeted website, the Javascript code on the page can be performed, generate the complete page of targeted website.
Further, described execution chain comprises several subparameters, and subparameter has multiple choices, and the selection of subparameter comprises: lower floor's URL link selection rule, metadata selected mark or the executable scripted code of system.
Further, described visualization script editing machine specific implementation flow process is as follows:
Step 1, in visualization script editor interface, input URL link address, targeted website;
Step 2, visualization script editing machine present targeted website URL link web page contents,
If step 3 does not need the URL link of the lower floor entering this webpage, then enter step 5, if need to enter lower floor's URL link, enter step 4;
The block of step 4, selection lower floor URL link, visualization script editing machine will record the position of these blocks, and perform chain stored in one, and all positional informations all with the form of CSS or XPATH grammer composition, return step 3;
Step 5, select several to need to capture the block of content, perform chain stored in one,
Step 6, user confirm that editing completes,
Step 7, visualization script editing machine import the execution chain recorded into script generator, produce the crawl script of corresponding targeted website, simultaneously for advanced level user, provide additional interface, user, by writing the code of compatible system, directly embeds among described crawl script;
Step 8, by script stored in database.
Tool of the present invention has the following advantages: the visualization script editing machine of native system, the capture area of unprofessional user's select target website related data intuitively can be made, the operation of user is transformed automatically and is generated as specific processing scripts, be in operation by each distributed processors unit in crawler system and these processing scripts are dynamically preferentially performed, greatly reduce the human cost customized needed for reptile, improve the operational efficiency of crawler system simultaneously.The accuracy rate of these system grabs data is high, and has high extensibility and retractility.
Accompanying drawing explanation
Fig. 1 is the structural representation of present system.
Fig. 2 is the workflow diagram of present system.
Fig. 3 is visualization script editing machine execution architecture schematic diagram of the present invention.
Fig. 4 is visualization script editing machine workflow schematic diagram of the present invention.
Fig. 5 is the process flow diagram that the present invention performs chain function mode.
Fig. 6 is the workflow diagram of content processing module of the present invention and script.
Fig. 7 is the structural representation of present system one embodiment.
Embodiment
Refer to shown in Fig. 1 to Fig. 7, a kind of distributed network crawler system based on visualization script editing machine of the present invention, described system comprises: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;
Described visualization script editing machine, for visual check targeted website content, the data grabber region of select target website; Its input by user is (from input targeted sites URL link, input to all user operations finally completed in editor produce) change into an execution chain and whether other inessential parameters (such as use text to extract, whether simulation browser etc.), generate corresponding script and stored in a database according to performing chain simultaneously; This visualization script editing machine makes user without the need to possessing programming skill, can check targeted website as normal browsing webpage.This script is script corresponding to targeted website;
Configuration Manager, WEB interface is provided, user can configure the website needing to crawl here, and for one or a series of website configuration schedules strategy (as: priority, timing crawls, and heavily climbs interval etc.), capture strategy (retry of makeing mistakes, enable agency, enable visit device simulation etc.) and other configuration parameters, form user configuration information.
Described Distributed Message Queue, for by task scheduling modules, webpage capture module, content processing module and result memory module carry out decoupling zero, achieve high distributed deployment ability.This Distributed Message Queue comprises scheduling queue, captures queue, processing queue and result queue;
Described task scheduling modules, for the running of responsible coordination whole system, read in targeted website (this targeted website is and will carries out processing the website judged) initial URL link and user's input information and import described scheduling queue into after being packaged into task, and obtain task object from scheduling queue, and after filtering iterative task, be sent to described crawl queue; Described task scheduling modules comprises url filtering module and rate manager, described url filtering module, Bloom filter is used to carry out duplicate removal to URL link, prevent from repeating to crawl same URL link, Bloom filter is actually and is made up of a very long binary vector and a series of random mapping function, for retrieving an element whether in a set; Its advantage be space efficiency and query time all considerably beyond general algorithm, shortcoming has certain false recognition rate and deletes difficulty.Use Bloom filter can improve system effectiveness greatly, and its shortcoming can not have an impact completely to crawler system, very applicable crawler system uses.Described rate manager, adopt token bucket algorithm to prevent network congestion, the flow of network is flowed out in restriction, and flow is outwards sent with uniform speed, ensures the stability of system.
Described webpage capture module, for getting URL link from crawl queue, automatically resolving website coding, and becoming UTF-8 to encode the Content Transformation of the website of crawl, the content of encode this UTF-8 and website relevant information forward after packing and deliver to processing queue; Described webpage capture module comprises: proxy access module and browser analog module, described proxy access module, along with the development of network technology, nowadays increasing website adopts dynamic page technology, employ a large amount of Javascript scripts to generate web page contents, and the webpage capture of traditional mode can only obtain the source code of the page, cannot perform Javascript script, cause the complete page that cannot obtain targeted sites, the difficulty multiplication that data are extracted.Proxy access module of the present invention can use the IP agency preset to conduct interviews to the URL link of specifying according to user configuration information, prevent described webpage capture module place server ip from being blocked by targeted website because of visit capacity is excessive, described browser analog module, WebKit is used to increase income browser engine to resolve targeted website, the Javascript code on the page can be performed, generate the complete page of targeted website.
Described content processing module, for getting the web page contents of website from described processing queue, if the URL link of this webpage and predefined URL matched rule match, (user is in advance in the targeted sites URL link that visualization script editing machine inputs, the condition intelligence arranged according to user by visual editor generates a URL matched rule), then the web page contents of script to website calling this URL link of coupling is resolved; Result after resolving is imported in result queue; Described result memory module, for taking out result data from result queue, and carries out unifying process screening, then stored in database according to the predefined configuration of system by result data.
Wherein, described system also comprises monitoring module and text extraction module, scheduling queue in the real-time monitoring distributed message queue of described monitoring module, capture queue, processing queue, whether result queue's four queues make mistakes, when an abnormality is discovered, timely PUSH message to the user interface of system, reminding user inspection make mistakes reason and whether re-start script input.
When the domain name coupling of webpage is less than during with script in described database, calls described text extraction module, carry out the corresponding script of extractions webpage, the text extracting mode of described text extraction module use view-based access control model web page blocks analytical technology is extracted.
In the present invention, call this script described in resolve; If what generate after dissection process is new URL link, then imports new URL link into described scheduling queue, re-execute task scheduling modules; If after dissection process be result data, then the result data after parsing is imported in result queue.
Described execution chain comprises several subparameters, and subparameter has multiple choices, and the selection of subparameter comprises: lower floor's URL link selection rule, metadata selected mark (form is as CSS, XPATH selector switch) or the executable scripted code of system.
As shown in Fig. 3,4,5, described visualization script editing machine specific implementation flow process is as follows:
Step 1, in visualization script editor interface, input URL link address, targeted website;
Step 2, visualization script editing machine present targeted website URL link web page contents,
If step 3 does not need the URL link of the lower floor entering this webpage, then enter step 5, if need to enter lower floor's URL link, enter step 4;
The block of step 4, selection lower floor URL link, visualization script editing machine will record the position of these blocks, and perform chain stored in one, (visualization script editing machine will record the position of these blocks, and perform chain concrete operations stored in one can see Fig. 5) all positional informations all with the form of CSS or XPATH grammer composition, return step 3;
Step 5, select several to need to capture the block of content, perform chain stored in one,
Step 6, user confirm that editing completes,
Step 7, visualization script editing machine import the execution chain recorded into script generator, produce the crawl script of corresponding targeted website, simultaneously for advanced level user, provide additional interface, user, by writing the code of compatible system, directly embeds among described crawl script;
Step 8, by script stored in database.
If Fig. 2 is the workflow diagram of present system, specific as follows:
(1) task scheduling modules access configuration administration module, reads in after initial URL link and user configuration information are packaged into task and imports scheduling queue into.
(2) task scheduling modules obtains task object from scheduling queue, and inquiry url filtering module, if do not access the URL link of this task, was then directly sent to crawl queue.If accessed, then detected the parameter (repay time etc.) that user is arranged, and if allow again to access, be also sent to crawl queue, otherwise abandon this task.Task after finally filter being weighed is sent to crawl queue.
(3) webpage capture module gets URL link from crawl queue, performs grasping manipulation, automatically resolves website coding, and the content of crawl is changed into the forwarding of packing of general UTF-8 coding and website relevant information and deliver to processing queue.
(4) content processing module gets web page contents from processing queue.If the information matches such as the domain name of this webpage have arrived the script that user pre-defines (script namely in database), then call this script and resolved.If what generate after process is new URL link, these links will import scheduling queue into, reenter step (2) if result data, then import result queue into.
(5) result memory module takes out result from result queue, does according to predetermined configuration, does final unified process, database of restoring.
(6) (2) ~ (5) are repeated until the system that receives is ceased and desisted order.
As the structural representation that Fig. 7 is present system one embodiment.This modules of the present invention all can be disposed with the many examples of unit, multimachine list example, the many way of example of multimachine.Namely system of the present invention can distributed deployment.
In addition, native system be sent to message queue data object be collectively referred to as task object, a task object comprises: 1. content (URL link, web page contents or result data etc. change according to the difference of message queue); 2. configuration parameter; 3. status indicator;
In fact be all first take out task object from message queue, then take out relevant information from task object.
Here it should be noted that: in the present invention, task scheduling modules, webpage capture module, content processing module and result memory module all can start Multi-instance on multiple servers, they realize decoupling zero by message queue, can stop or increasing the example of any type at any time.This kind of design can in the extendability of maximum elevator system and retractility.
In a word, the present invention is inputted by visualization interface according to user, system generates meta-data extraction script automatically, the structure of targeted sites can be identified, capture specific data efficiently, create assigned tasks by task scheduling modules, webpage capture module in charge captures the page, it is metadata set that content processing module transfers corresponding script by conversion of page, and finally unified process, is stored by result memory module.The present invention significantly can improve and crawls efficiency for particular station data, reduces user's labour intensity, saves system resource, and have good extensibility and retractility, be applicable to all types of internet sites.
The foregoing is only preferred embodiment of the present invention, all equalizations done according to the present patent application the scope of the claims change and modify, and all should belong to covering scope of the present invention.
Claims (8)
1. based on a distributed network crawler system for visualization script editing machine, it is characterized in that: described system comprises: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;
Described visualization script editing machine, for checking targeted website, and select target website data capture area; The input of user being changed into an execution chain, simultaneously generating corresponding script and stored in a database according to performing chain; This script is script corresponding to targeted website;
Described Distributed Message Queue, for by task scheduling modules, webpage capture module, content processing module and result memory module carry out decoupling zero, and this Distributed Message Queue comprises scheduling queue, captures queue, processing queue and result queue;
Described task scheduling modules, for the running of responsible coordination whole system, read in after the initial URL link in targeted website and user's input information are packaged into task and import described scheduling queue into, and obtain task object from scheduling queue, and after filtering iterative task, be sent to described crawl queue;
Described webpage capture module, for getting URL link from crawl queue, automatically resolving website coding, and becoming UTF-8 to encode the Content Transformation of the website of crawl, the content of encode this UTF-8 and website relevant information forward after packing and deliver to processing queue;
Described content processing module, for getting the web page contents of website from described processing queue, if the URL link of this webpage and predefined URL matched rule match, then the web page contents of script to website calling this URL matched rule of coupling corresponding is resolved; Result after resolving is imported in result queue;
Described result memory module, for taking out result data from result queue, and carries out unifying process screening, then stored in database according to the predefined configuration of system by result data.
2. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described system also comprises monitoring module, scheduling queue in the real-time monitoring distributed message queue of described monitoring module, capture queue, processing queue, whether result queue's four queues make mistakes, when an abnormality is discovered, timely PUSH message to the user interface of system, reminding user inspection make mistakes reason and whether re-start script input.
3. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described system also comprises text extraction module, when the domain name coupling of webpage is less than during with script in described database, call described text extraction module, carry out the corresponding script of extraction webpage, described text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology to extract.
4. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, is characterized in that: described in call this script and resolve; If what generate after dissection process is new URL link, then imports new URL link into described scheduling queue, re-execute task scheduling modules; If after dissection process be result data, then the result data after parsing is imported in result queue.
5. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described task scheduling modules comprises url filtering module and rate manager, described url filtering module, Bloom filter is used to carry out duplicate removal to URL link, prevent from repeating to crawl same URL link, Bloom filter is made up of a binary vector and a series of random mapping function, for retrieving an element whether in a set; Described rate manager, adopt token bucket algorithm to prevent network congestion, the flow of network is flowed out in restriction, and flow is outwards sent with uniform speed, ensures the stability of system.
6. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described webpage capture module comprises: proxy access module and browser analog module, described proxy access module, the IP agency preset is used to conduct interviews according to user configuration information to the URL link of specifying, prevent described webpage capture module place server ip from being blocked by targeted website because of visit capacity is excessive, described browser analog module, WebKit is used to increase income browser engine to resolve targeted website, the Javascript code on the page can be performed, generate the complete page of targeted website.
7. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, it is characterized in that: described execution chain comprises several subparameters, subparameter has multiple choices, the selection of subparameter comprises: lower floor's URL link selection rule, metadata selected mark or the executable scripted code of system.
8. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, is characterized in that: described visualization script editing machine specific implementation flow process is as follows:
Step 1, in visualization script editor interface, input URL link address, targeted website;
Step 2, visualization script editing machine present targeted website URL link web page contents,
If step 3 does not need the URL link of the lower floor entering this webpage, then enter step 5, if need to enter lower floor's URL link, enter step 4;
The block of step 4, selection lower floor URL link, visualization script editing machine will record the position of these blocks, and perform chain stored in one, and all positional informations all with the form of CSS or XPATH grammer composition, return step 3;
Step 5, select several to need to capture the block of content, perform chain stored in one,
Step 6, user confirm that editing completes,
Step 7, visualization script editing machine import the execution chain recorded into script generator, produce the crawl script of corresponding targeted website, simultaneously for advanced level user, provide additional interface, user, by writing the code of compatible system, directly embeds among described crawl script;
Step 8, by script stored in database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510713985.7A CN105243159B (en) | 2015-10-28 | 2015-10-28 | A kind of distributed network crawler system based on visualization script editing machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510713985.7A CN105243159B (en) | 2015-10-28 | 2015-10-28 | A kind of distributed network crawler system based on visualization script editing machine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105243159A true CN105243159A (en) | 2016-01-13 |
CN105243159B CN105243159B (en) | 2019-06-25 |
Family
ID=55040807
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510713985.7A Active CN105243159B (en) | 2015-10-28 | 2015-10-28 | A kind of distributed network crawler system based on visualization script editing machine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105243159B (en) |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106168985A (en) * | 2016-08-26 | 2016-11-30 | 南京车易淘网络信息技术有限公司 | A kind of can the reptile method of fast distributed deployment |
CN106886547A (en) * | 2016-07-13 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of scenario generation method and device |
CN106933973A (en) * | 2017-02-14 | 2017-07-07 | 广州优亿信息科技有限公司 | A kind of visual network reptile method |
CN106980687A (en) * | 2017-03-31 | 2017-07-25 | 北京奇艺世纪科技有限公司 | A kind of resource downloading system, method and reptile download system |
CN107103242A (en) * | 2017-05-11 | 2017-08-29 | 北京安赛创想科技有限公司 | The acquisition methods and device of data |
CN107180050A (en) * | 2016-03-11 | 2017-09-19 | 精硕科技(北京)股份有限公司 | A kind of data grabber system and method |
CN107317724A (en) * | 2017-06-06 | 2017-11-03 | 中证信用增进股份有限公司 | Data collecting system and method based on cloud computing technology |
CN107577788A (en) * | 2017-09-15 | 2018-01-12 | 广东技术师范学院 | A crawler method for e-commerce website theme with automatic structured data |
CN107870965A (en) * | 2017-08-11 | 2018-04-03 | 成都萌想科技有限责任公司 | One kind visualization data collecting system |
CN108108440A (en) * | 2017-12-21 | 2018-06-01 | 北京慧数科技有限公司 | The acquisition method of proxy server and internet data |
CN108228614A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | Detect the method and device of webpage chain rupture |
CN108351941A (en) * | 2015-11-02 | 2018-07-31 | 日本电信电话株式会社 | Analytical equipment, analysis method and analysis program |
CN108549678A (en) * | 2018-04-02 | 2018-09-18 | 北京今朝在线科技有限公司 | Information acquisition system |
CN108875091A (en) * | 2018-08-14 | 2018-11-23 | 杭州费尔斯通科技有限公司 | A kind of distributed network crawler system of unified management |
CN109101636A (en) * | 2018-08-16 | 2018-12-28 | 成都市映潮科技股份有限公司 | A kind of method, apparatus and system carrying out data acquisition in cloud by visual configuration |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
CN109285046A (en) * | 2018-08-10 | 2019-01-29 | 浙江工业大学 | An e-commerce big data collection system based on business plug-in |
CN109522466A (en) * | 2018-10-20 | 2019-03-26 | 河南工程学院 | A kind of distributed reptile system |
CN109614539A (en) * | 2019-01-16 | 2019-04-12 | 重庆金融资产交易所有限责任公司 | Data grab method, device and computer readable storage medium |
CN109783715A (en) * | 2019-01-08 | 2019-05-21 | 鑫涌算力信息科技(上海)有限公司 | Network crawler system and method |
CN109948026A (en) * | 2019-03-28 | 2019-06-28 | 深信服科技股份有限公司 | A kind of web data crawling method, device, equipment and medium |
CN110020066A (en) * | 2017-07-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device of past crawler platform note task |
CN110020062A (en) * | 2019-04-12 | 2019-07-16 | 北京邮电大学 | A kind of customized web crawlers method and system |
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110413276A (en) * | 2019-07-31 | 2019-11-05 | 网易(杭州)网络有限公司 | Parameter edit methods and device, electronic equipment, storage medium |
CN110457556A (en) * | 2019-07-04 | 2019-11-15 | 重庆金融资产交易所有限责任公司 | Distributed reptile system architecture, the method and computer equipment for crawling data |
CN110807137A (en) * | 2019-04-11 | 2020-02-18 | 上海丛云信息科技有限公司 | Distributed big data acquisition implementation method |
CN110851681A (en) * | 2019-10-12 | 2020-02-28 | 平安科技(深圳)有限公司 | Crawler processing method and device, server and computer readable storage medium |
CN111045659A (en) * | 2019-11-11 | 2020-04-21 | 国家计算机网络与信息安全管理中心 | Method and system for collecting project list of Internet financial webpage |
CN111178057A (en) * | 2020-01-02 | 2020-05-19 | 大汉软件股份有限公司 | Content analysis and extraction system of government affair electronic document |
CN111310002A (en) * | 2020-04-17 | 2020-06-19 | 西安热工研究院有限公司 | General crawler system based on distributor and configuration table combination |
CN111651656A (en) * | 2020-06-02 | 2020-09-11 | 重庆邮电大学 | A dynamic web crawler method and system based on foundry mode |
CN112100061A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Visual crawler code compiling and debugging method |
CN112256636A (en) * | 2020-11-10 | 2021-01-22 | 国网湖南省电力有限公司 | Data acquisition system for mobile application APP |
CN112328238A (en) * | 2021-01-05 | 2021-02-05 | 深圳点猫科技有限公司 | Building block code execution control method, system and storage medium |
CN112364226A (en) * | 2020-11-12 | 2021-02-12 | 江苏易启策网络科技有限公司 | Interactive information acquisition method and system based on dynamic content analysis |
CN112434205A (en) * | 2020-11-30 | 2021-03-02 | 北京秒针人工智能科技有限公司 | Data integration capturing method and system based on data site and computer equipment |
CN112487269A (en) * | 2020-12-22 | 2021-03-12 | 安徽商信政通信息技术股份有限公司 | Crawler automation script detection method and device |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
CN112783615A (en) * | 2019-11-08 | 2021-05-11 | 北京沃东天骏信息技术有限公司 | Method and device for cleaning data processing task |
CN112818201A (en) * | 2021-02-07 | 2021-05-18 | 四川封面传媒有限责任公司 | Network data acquisition method and device, computer equipment and storage medium |
CN113656674A (en) * | 2021-08-30 | 2021-11-16 | 山谷网安科技股份有限公司 | Automatic processing method and device for click type hyperlink in website crawler |
CN113742550A (en) * | 2021-08-20 | 2021-12-03 | 广州市易工品科技有限公司 | Data acquisition method, device and system based on browser |
CN113918793A (en) * | 2021-12-10 | 2022-01-11 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data acquisition method |
CN113934912A (en) * | 2021-11-11 | 2022-01-14 | 北京搜房科技发展有限公司 | Data crawling method and device, storage medium and electronic equipment |
CN114861101A (en) * | 2022-01-25 | 2022-08-05 | 浙江浩瀚能源科技有限公司 | A method, device, device and medium for detecting abnormal hyperlinks on portal website |
CN117633324A (en) * | 2023-11-03 | 2024-03-01 | 北京东方通网信科技有限公司 | Custom visual crawler configuration method |
CN118349719A (en) * | 2024-05-10 | 2024-07-16 | 南昌卓蓝科技有限公司 | Cloud big data acquisition crawler system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001033345A1 (en) * | 1999-11-02 | 2001-05-10 | Alta Vista Company | System and method for enforcing politeness while scheduling downloads in a web crawler |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
CN102982161A (en) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | Method and device for acquiring webpage information |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104765592A (en) * | 2014-01-03 | 2015-07-08 | 任子行网络技术股份有限公司 | Plugin management method and device facing web page acquisition task |
-
2015
- 2015-10-28 CN CN201510713985.7A patent/CN105243159B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001033345A1 (en) * | 1999-11-02 | 2001-05-10 | Alta Vista Company | System and method for enforcing politeness while scheduling downloads in a web crawler |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
CN102982161A (en) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | Method and device for acquiring webpage information |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104765592A (en) * | 2014-01-03 | 2015-07-08 | 任子行网络技术股份有限公司 | Plugin management method and device facing web page acquisition task |
Non-Patent Citations (1)
Title |
---|
高龙: "搜索引擎中通用爬虫系统的研究与设计", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (68)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108351941B (en) * | 2015-11-02 | 2021-10-26 | 日本电信电话株式会社 | Analysis device, analysis method, and computer-readable storage medium |
CN108351941A (en) * | 2015-11-02 | 2018-07-31 | 日本电信电话株式会社 | Analytical equipment, analysis method and analysis program |
CN107180050A (en) * | 2016-03-11 | 2017-09-19 | 精硕科技(北京)股份有限公司 | A kind of data grabber system and method |
WO2018010573A1 (en) * | 2016-07-13 | 2018-01-18 | 阿里巴巴集团控股有限公司 | Method and device for generating script |
CN106886547A (en) * | 2016-07-13 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of scenario generation method and device |
TWI683225B (en) * | 2016-07-13 | 2020-01-21 | 香港商阿里巴巴集團服務有限公司 | Script generation method and device |
CN106168985A (en) * | 2016-08-26 | 2016-11-30 | 南京车易淘网络信息技术有限公司 | A kind of can the reptile method of fast distributed deployment |
CN108228614B (en) * | 2016-12-14 | 2022-03-18 | 北京国双科技有限公司 | Method and device for detecting webpage broken link |
CN108228614A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | Detect the method and device of webpage chain rupture |
CN106933973A (en) * | 2017-02-14 | 2017-07-07 | 广州优亿信息科技有限公司 | A kind of visual network reptile method |
CN106980687A (en) * | 2017-03-31 | 2017-07-25 | 北京奇艺世纪科技有限公司 | A kind of resource downloading system, method and reptile download system |
CN106980687B (en) * | 2017-03-31 | 2020-05-22 | 北京奇艺世纪科技有限公司 | Resource downloading system, method and crawler downloading system |
CN107103242A (en) * | 2017-05-11 | 2017-08-29 | 北京安赛创想科技有限公司 | The acquisition methods and device of data |
CN107317724A (en) * | 2017-06-06 | 2017-11-03 | 中证信用增进股份有限公司 | Data collecting system and method based on cloud computing technology |
CN110020066A (en) * | 2017-07-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device of past crawler platform note task |
CN110020066B (en) * | 2017-07-31 | 2021-09-07 | 北京国双科技有限公司 | Method and device for annotating tasks to crawler platform |
CN107870965A (en) * | 2017-08-11 | 2018-04-03 | 成都萌想科技有限责任公司 | One kind visualization data collecting system |
CN107577788A (en) * | 2017-09-15 | 2018-01-12 | 广东技术师范学院 | A crawler method for e-commerce website theme with automatic structured data |
CN107577788B (en) * | 2017-09-15 | 2021-12-31 | 广东技术师范大学 | E-commerce website topic crawler method for automatically structuring data |
CN108108440A (en) * | 2017-12-21 | 2018-06-01 | 北京慧数科技有限公司 | The acquisition method of proxy server and internet data |
CN108549678A (en) * | 2018-04-02 | 2018-09-18 | 北京今朝在线科技有限公司 | Information acquisition system |
CN108549678B (en) * | 2018-04-02 | 2020-06-19 | 北京今朝在线科技有限公司 | Information acquisition system |
CN109285046A (en) * | 2018-08-10 | 2019-01-29 | 浙江工业大学 | An e-commerce big data collection system based on business plug-in |
CN108875091B (en) * | 2018-08-14 | 2020-06-02 | 杭州费尔斯通科技有限公司 | Distributed web crawler system with unified management |
CN108875091A (en) * | 2018-08-14 | 2018-11-23 | 杭州费尔斯通科技有限公司 | A kind of distributed network crawler system of unified management |
CN109101636A (en) * | 2018-08-16 | 2018-12-28 | 成都市映潮科技股份有限公司 | A kind of method, apparatus and system carrying out data acquisition in cloud by visual configuration |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
CN109522466A (en) * | 2018-10-20 | 2019-03-26 | 河南工程学院 | A kind of distributed reptile system |
CN109783715A (en) * | 2019-01-08 | 2019-05-21 | 鑫涌算力信息科技(上海)有限公司 | Network crawler system and method |
CN109614539A (en) * | 2019-01-16 | 2019-04-12 | 重庆金融资产交易所有限责任公司 | Data grab method, device and computer readable storage medium |
CN109948026A (en) * | 2019-03-28 | 2019-06-28 | 深信服科技股份有限公司 | A kind of web data crawling method, device, equipment and medium |
CN110807137A (en) * | 2019-04-11 | 2020-02-18 | 上海丛云信息科技有限公司 | Distributed big data acquisition implementation method |
CN110020062B (en) * | 2019-04-12 | 2021-09-24 | 北京邮电大学 | A customizable web crawler method and system |
CN110020062A (en) * | 2019-04-12 | 2019-07-16 | 北京邮电大学 | A kind of customized web crawlers method and system |
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110297962B (en) * | 2019-06-28 | 2021-08-24 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110457556B (en) * | 2019-07-04 | 2023-11-14 | 重庆金融资产交易所有限责任公司 | Distributed crawler system architecture, method for crawling data and computer equipment |
CN110457556A (en) * | 2019-07-04 | 2019-11-15 | 重庆金融资产交易所有限责任公司 | Distributed reptile system architecture, the method and computer equipment for crawling data |
CN110413276B (en) * | 2019-07-31 | 2024-04-09 | 网易(杭州)网络有限公司 | Parameter editing method and device, electronic equipment and storage medium |
CN110413276A (en) * | 2019-07-31 | 2019-11-05 | 网易(杭州)网络有限公司 | Parameter edit methods and device, electronic equipment, storage medium |
CN110851681A (en) * | 2019-10-12 | 2020-02-28 | 平安科技(深圳)有限公司 | Crawler processing method and device, server and computer readable storage medium |
CN112783615B (en) * | 2019-11-08 | 2024-03-01 | 北京沃东天骏信息技术有限公司 | Data processing task cleaning method and device |
CN112783615A (en) * | 2019-11-08 | 2021-05-11 | 北京沃东天骏信息技术有限公司 | Method and device for cleaning data processing task |
CN111045659A (en) * | 2019-11-11 | 2020-04-21 | 国家计算机网络与信息安全管理中心 | Method and system for collecting project list of Internet financial webpage |
CN111178057B (en) * | 2020-01-02 | 2024-01-30 | 大汉软件股份有限公司 | Content analysis and extraction system for government electronic documents |
CN111178057A (en) * | 2020-01-02 | 2020-05-19 | 大汉软件股份有限公司 | Content analysis and extraction system of government affair electronic document |
CN111310002B (en) * | 2020-04-17 | 2023-04-07 | 西安热工研究院有限公司 | General crawler system based on distributor and configuration table combination |
CN111310002A (en) * | 2020-04-17 | 2020-06-19 | 西安热工研究院有限公司 | General crawler system based on distributor and configuration table combination |
CN111651656A (en) * | 2020-06-02 | 2020-09-11 | 重庆邮电大学 | A dynamic web crawler method and system based on foundry mode |
CN112100061A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Visual crawler code compiling and debugging method |
CN112256636A (en) * | 2020-11-10 | 2021-01-22 | 国网湖南省电力有限公司 | Data acquisition system for mobile application APP |
CN112364226A (en) * | 2020-11-12 | 2021-02-12 | 江苏易启策网络科技有限公司 | Interactive information acquisition method and system based on dynamic content analysis |
CN112434205A (en) * | 2020-11-30 | 2021-03-02 | 北京秒针人工智能科技有限公司 | Data integration capturing method and system based on data site and computer equipment |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
CN112487269B (en) * | 2020-12-22 | 2023-10-24 | 安徽商信政通信息技术股份有限公司 | Method and device for detecting automation script of crawler |
CN112487269A (en) * | 2020-12-22 | 2021-03-12 | 安徽商信政通信息技术股份有限公司 | Crawler automation script detection method and device |
CN112328238A (en) * | 2021-01-05 | 2021-02-05 | 深圳点猫科技有限公司 | Building block code execution control method, system and storage medium |
CN112818201A (en) * | 2021-02-07 | 2021-05-18 | 四川封面传媒有限责任公司 | Network data acquisition method and device, computer equipment and storage medium |
CN113742550A (en) * | 2021-08-20 | 2021-12-03 | 广州市易工品科技有限公司 | Data acquisition method, device and system based on browser |
CN113742550B (en) * | 2021-08-20 | 2024-04-19 | 广州市易工品科技有限公司 | Browser-based data acquisition method, device and system |
CN113656674B (en) * | 2021-08-30 | 2023-06-27 | 山谷网安科技股份有限公司 | Automatic processing method and device for click type hyperlink in website crawler |
CN113656674A (en) * | 2021-08-30 | 2021-11-16 | 山谷网安科技股份有限公司 | Automatic processing method and device for click type hyperlink in website crawler |
CN113934912A (en) * | 2021-11-11 | 2022-01-14 | 北京搜房科技发展有限公司 | Data crawling method and device, storage medium and electronic equipment |
CN113918793A (en) * | 2021-12-10 | 2022-01-11 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data acquisition method |
CN114861101A (en) * | 2022-01-25 | 2022-08-05 | 浙江浩瀚能源科技有限公司 | A method, device, device and medium for detecting abnormal hyperlinks on portal website |
CN117633324A (en) * | 2023-11-03 | 2024-03-01 | 北京东方通网信科技有限公司 | Custom visual crawler configuration method |
CN117633324B (en) * | 2023-11-03 | 2024-07-30 | 北京东方通网信科技有限公司 | Custom visual crawler configuration method |
CN118349719A (en) * | 2024-05-10 | 2024-07-16 | 南昌卓蓝科技有限公司 | Cloud big data acquisition crawler system |
Also Published As
Publication number | Publication date |
---|---|
CN105243159B (en) | 2019-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN107895009B (en) | Distributed internet data acquisition method and system | |
CN106021257B (en) | A kind of crawler capturing data method, apparatus and system for supporting online programming | |
CN109829096B (en) | Data acquisition method and device, electronic equipment and storage medium | |
CN107391775A (en) | A kind of general web crawlers model implementation method and system | |
CN105260388A (en) | Optimization method of distributed vertical crawler service system | |
CN102521232B (en) | Distributed acquisition and processing system and method of internet metadata | |
WO2010114913A1 (en) | Method and system of retrieving ajax web page content | |
CN102880607A (en) | network dynamic content capturing method and network dynamic content crawler system | |
CN102982161A (en) | Method and device for acquiring webpage information | |
CN106776983B (en) | Search engine optimization device and method | |
CN107145556B (en) | Universal distributed acquisition system | |
CN104182482B (en) | A kind of news list page determination methods and the method for screening news list page | |
CN102262635A (en) | Page crawler system and page crawler method | |
CN111859076B (en) | Data crawling method, device, computer equipment and computer readable storage medium | |
CN102982162A (en) | System for acquiring webpage information | |
CN110263070A (en) | Event report method and device | |
Nigam et al. | Web scraping: from tools to related legislation and implementation using python | |
CN109600385A (en) | A kind of access control method and device | |
KR20190131778A (en) | Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL | |
CN109597952A (en) | Web information processing method, system, electronic equipment and storage medium | |
CN109446441B (en) | General credible distributed acquisition and storage system for network community | |
CN109766488B (en) | Data acquisition method based on Scapy | |
CN115422427A (en) | Employment skill requirement analysis system | |
CN110472126A (en) | A kind of acquisition methods of page data, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |