CN104102740A - Distribution type information acquisition system and method - Google Patents
Distribution type information acquisition system and method Download PDFInfo
- Publication number
- CN104102740A CN104102740A CN201410371132.5A CN201410371132A CN104102740A CN 104102740 A CN104102740 A CN 104102740A CN 201410371132 A CN201410371132 A CN 201410371132A CN 104102740 A CN104102740 A CN 104102740A
- Authority
- CN
- China
- Prior art keywords
- task
- information
- data analysis
- devices
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 16
- 238000007405 data analysis Methods 0.000 claims abstract description 40
- 230000002688 persistence Effects 0.000 claims description 7
- 238000007726 management method Methods 0.000 abstract description 40
- 238000013500 data storage Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 11
- 230000008859 change Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a distribution type information acquisition system and method. The distribution type information acquisition system comprises one or a plurality of request generation devices, one or a plurality of task management devices, a plurality of task execution devices, an information collection device, one or a plurality of data analysis devices and a data storage device, wherein the request generation devices used for generating task requests for obtaining information; the task management devices are used for confirming the priority of a task according to the task request and assigning the task; the task execution devices are used for executing the task according to the distribution of a task manager to obtain information; the information collection device is used for collecting information obtained by the task execution devices; the data analysis device is used for carrying out data analysis to the information; and the data storage device is used for storing the information subjected to the data analysis. The information on an internet is captured by the low-cost task execution devices of the distribution type information acquisition system.
Description
Technical field
The present invention relates to distributed system, in particular to distributed information, obtain system and method.
Background technology
Along with the development of infotech, the information on network is more and more, by the network search engines information of searching, has become the important information source in people's life.Network search engines mainly utilizes network information acquisition device to obtain the information on network.
At present, known network information acquisition device (for example web crawlers) is to use separate unit PC (generally adopting X86-based) or server to carry out data acquisition work, again by the data storing collecting in data storage module, but the capture apparatus that this kind of mode adopts is single, thereby causing information acquisition efficiency not high, the spent time is long.Because the equipment (as PC, server etc.) using is expensive, cost is high again, therefore to information acquisition work, makes troubles.
Summary of the invention
In order to address the above problem, the invention provides a kind of distributed information and obtain system, comprising: one or more request generating device, for generating the task requests of obtaining information; One or more task management devices, for priority the allocating task setting the tasks according to described task requests; A plurality of task execution devices, for carrying out described task with obtaining information according to the distribution of described task manager; Information collection apparatus, the information of obtaining for collecting a plurality of described task execution devices; One or more data analysis devices, for carrying out Data Analysis to described information; And data memory device, for storing the information of carrying out Data Analysis.
Preferably, described task management device also comprises: task persistence device, and for task requests described in each is stored to described data memory device.
Preferably, also comprise: message queue device, for information structure queue that described information collection apparatus is collected and by the order of queue, described information is committed to described data analysis device.
Preferably, described information collection apparatus adopts asynchronous system that the information of collecting is sent to described message queue device.
Preferably, described a plurality of task execution device also feeds back the executing state of task execution device described in each to described task management device and distributes unenforced task for described task management device.
According to another aspect of the invention, also provide a kind of distributed information acquisition methods, comprising: the task requests that generates obtaining information; The priority setting the tasks according to described task requests to a plurality of task execution device allocating tasks; Described a plurality of task execution device is carried out described task with obtaining information; The information that collection is obtained; Described information is carried out to Data Analysis; And the information of Data Analysis was carried out in storage.
Preferably, the described priority setting the tasks according to described task requests also comprising to a plurality of task execution device allocating tasks: task requests described in each is stored to a data memory device.
Preferably, after the information that described collection is obtained, described described information is carried out before Data Analysis also comprising: by the information structure queue of collecting and carry out Data Analysis by the order of queue.
Preferably, described by the information structure queue of collecting and carry out Data Analysis by the order of queue and comprise: adopt asynchronous system by the information updating of collecting to described queue.
Preferably, also comprise: according to the executing state of the allocating task of a plurality of described task execution devices, to a plurality of described task execution devices, distribute unenforced task.
The method that the present invention obtains by distributed information, task of utilizing task management device to obtain according to task priority assignment information to a plurality of task execution devices, after task execution device obtaining information, by asynchronous mode, by information collection apparatus, gathered and form the queue with lastest imformation and analyze for data analysis device, and result is stored to data memory device.Many task execution devices do not need X86 system, can reduce costs.The present invention captures the information on network by many low-cost task execution devices of above-mentioned distributed system management.
Accompanying drawing explanation
By describe its example embodiment in detail with reference to accompanying drawing, above-mentioned and further feature of the present invention and advantage will become more obvious.
The distributed information that Fig. 1 illustrates first embodiment of the invention obtains the structural representation of system; And
Fig. 2 illustrates the process flow diagram of the distributed information acquisition methods of first embodiment of the invention.
Embodiment
Referring now to accompanying drawing, example embodiment is more fully described.Yet example embodiment can be implemented in a variety of forms, and should not be understood to be limited to embodiment set forth herein; On the contrary, provide these embodiments to make the present invention by comprehensive and complete, and the design of example embodiment is conveyed to those skilled in the art all sidedly.
The distributed information that Fig. 1 illustrates first embodiment of the invention obtains the structural representation of system.Distributed information obtains system and comprises a plurality of request generating device 102, a plurality of task management device 104, a plurality of task execution device 106, information collection apparatus 108, a plurality of data analysis device 112 and a data memory device 114.A plurality of request generating device 102 and a plurality of task management device 104 connecting communications.A plurality of task management devices 104 and a plurality of task execution device 106 connecting communications.A plurality of task execution devices 106 and information collection apparatus 108 connecting communications.Information collection apparatus 108 and a plurality of data analysis device 112 connecting communications.A plurality of data analysis devices 112 and data memory device 114 connecting communications.Wherein, the connected mode between each device can be wired connection or wireless connections, and the mode of wireless connections includes infrared connection, bluetooth connection, LAN (Local Area Network) connection, internet connection etc.Communication between each device preferably, is used http protocol.
Particularly, it is corresponding with a plurality of task execution devices 106 that the present embodiment illustrates a plurality of task management devices 104, and the number of task management device 104 is identical with the number of task execution device 106.Preferably, a plurality of task execution devices 106 are divided into different execution modules according to region or route distance, and each execution module is managed by the task management device 104 of respective numbers.At one, change in example, task management device 104 is not identical with the number of task execution device 106, and usually, the number of task management device 104 is less than task execution device 106.A task management device 104 is managed predetermined a plurality of task execution devices 106 or manages according to real-time task execution devices 106 of must selecting such as the network bandwidth, task definitions.Further, the number of request generating device 102, task management device 104, data analysis device 112 can be one.
Request generating device 102 generates the task requests of obtaining information, and preferably, request generating device 102 is obtained task requests with user interactions.At one, change in example, request generating device 102 generates task requests by Event triggered.A request generating device 102 can generate a plurality of task requests.Originally illustrate a plurality of request generating device 102.A plurality of request generating device 102 are sent to a plurality of task management devices 104 by the list of the task requests of generation.The priority that a plurality of task management devices 104 set the tasks according to task requests to task execution device allocating task.Wherein, the priority of task preferably, is determined by user, and is sent to together task management device 104 by the list of task requests.At one, change in example, the priority that task management device 104 sets the tasks according to the state of the content of task and task execution device 104 or predetermined task strategy, for example, predetermined strategy is the preferential large task of the amount of finishing the work, and task management device 104 sorts to task priority by workload.Again for example, predetermined strategy is for preferentially to complete the task that flow is less, and task management device 104 sorts to task priority by flow.Task management device 104 has been determined after the priority of task, different tasks has been dispensed to different task execution device 106.Task execution device 106 is executed the task with obtaining information according to the priority of task.The information that a plurality of task execution devices 106 of information collection apparatus 108 collection obtain is also sent to data analysis device 112 by the information of obtaining.112 pairs of information of obtaining of a plurality of data analysis devices carry out Data Analysis and by the data storing after resolving to data memory device 114.Wherein, task execution device 106 also feeds back executing states to task management device 104, for example, during whether task is carried out, network environment of execution etc.
This figure also shows a task persistence device 116, and task persistence device 116 is stored to data memory device 114 for the task list that task management device 104 is obtained, to promote the stability of whole device.
This figure also shows a message queue device 110, and message queue device 110 is for information structure queue that information collection apparatus 108 is collected and by the order of queue, information is committed to data analysis device 112.Wherein, information collection apparatus 108 adopts asynchronous mode that the information of collecting is sent to message queue device 110.
Preferably, distributed information provided by the present invention obtains system and is preferably applied to crawler system, and request generating device 102 generates the task requests of obtaining information according to the keyword of user's input.And this task requests is divided into a task list of a plurality of tasks formation.Request generating device 102 is sent to a plurality of task management devices 104 by the task list of generation.The priority that a plurality of task management devices 104 set the tasks according to task requests to task execution device 106 allocating tasks.Task execution device 106 captures information according to the priority of task.Information collection apparatus 108 is collected the information of a plurality of task execution devices 106 crawls and the information of crawl is sent to data analysis device 112.112 pairs of information of obtaining of a plurality of data analysis devices are carried out Data Analysis and are obtained result URL and the URL after resolving is stored to data memory device 114.
Fig. 2 illustrates the process flow diagram of the distributed information acquisition methods of first embodiment of the invention.Distributed information obtains system and comprises a plurality of request generating device, a plurality of task management device, task persistence device, a plurality of task execution device, an information collection apparatus, message queue device, a plurality of data analysis device and a data memory device.A plurality of request generating device and a plurality of task management device connecting communication.A plurality of task management devices and task persistence device and a plurality of task execution device connecting communication.A plurality of task execution devices and an information collection apparatus connecting communication.Information collection apparatus is by a message queue device and a plurality of data analysis device connecting communication.A plurality of data analysis devices and a data memory device connecting communication.Particularly, originally illustrate 8 steps.
Step S101, the task requests of generation obtaining information.
Particularly, task requests is generated by request generating device, and preferably, request generating device and user interactions obtain task requests.At one, change in example, task requests is generated by Event triggered.A task requests can be split as a task requests list and carry out for a plurality of task execution devices.One changes in example, and a plurality of task requests form a task requests list and carried out by a plurality of task execution devices.
Step S102, the priority setting the tasks according to task requests to task execution device allocating task.
Wherein, this step is completed by task management device.The priority of task preferably, is determined by user, and is sent to together a task management device by the list of task requests.At one, change in example, the priority that task management device sets the tasks according to the state of the content of task and task execution device or predetermined task strategy, for example, predetermined strategy is the preferential large task of the amount of finishing the work, and task management device sorts to task priority by workload.Again for example, predetermined strategy is for preferentially to complete the task that flow is less, and task management device sorts to task priority by flow.Task management device has been determined after the priority of task, different tasks has been dispensed to different task execution devices.
Step S103, executes the task with obtaining information according to the priority of task.
The task list that task persistence device obtains task management device is stored to data memory device, to promote the stability of whole device.
Step S104, task execution device is executed the task with obtaining information according to the priority of task.
Step S105, information collection apparatus is collected the information that a plurality of task execution devices obtain.
Step S106, information collection apparatus adopts asynchronous framework that the information of obtaining is sent to message queue device, and message queue device is by the information structure obtaining or renewal queue and according to the order of queue, information is sent to data analysis device.
Step S107, data analysis device the information of obtaining is carried out to Data Analysis and by the information storage after resolving to data memory device.
Step S108, task execution device also feeds back executing state to task management device, for example, and during whether task is carried out, network environment of execution etc.Particularly, this step is not limited in after step S107, and it can carry out in any time of the method.
Although the present invention discloses as above with preferred embodiment, yet it is not in order to limit the present invention.Those skilled in the art, without departing from the spirit and scope of the present invention, when doing various changes and modification.Therefore the scope that, protection scope of the present invention ought define depending on claims is as the criterion.
Claims (10)
1. distributed information obtains a system, it is characterized in that, comprising:
One or more request generating device, for generating the task requests of obtaining information;
One or more task management devices, for priority the allocating task setting the tasks according to described task requests;
A plurality of task execution devices, for carrying out described task with obtaining information according to the distribution of described task manager;
Information collection apparatus, the information of obtaining for collecting a plurality of described task execution devices;
One or more data analysis devices, for carrying out Data Analysis to described information; And
Data memory device, for storing the information of carrying out Data Analysis.
2. distributed information according to claim 1 obtains system, it is characterized in that, described task management device also comprises:
Task persistence device, for being stored to described data memory device by task requests described in each.
3. distributed information according to claim 1 obtains system, it is characterized in that, also comprises:
Message queue device, for information structure queue that described information collection apparatus is collected and by the order of queue, described information is committed to described data analysis device.
4. distributed information according to claim 3 obtains system, it is characterized in that, described information collection apparatus adopts asynchronous system that the information of collecting is sent to described message queue device.
5. distributed information according to claim 1 obtains system, it is characterized in that, described a plurality of task execution devices also feed back the executing state of task execution device described in each to described task management device and distribute unenforced task for described task management device.
6. a distributed information acquisition methods, is characterized in that, comprising:
Generate the task requests of obtaining information;
The priority setting the tasks according to described task requests to a plurality of task execution device allocating tasks;
Described a plurality of task execution device is carried out described task with obtaining information;
The information that collection is obtained;
Described information is carried out to Data Analysis; And
The information of Data Analysis was carried out in storage.
7. distributed information acquisition methods according to claim 6, is characterized in that, the described priority setting the tasks according to described task requests also also comprises to a plurality of task execution device allocating tasks:
Task requests described in each is stored to a data memory device.
8. distributed information acquisition methods according to claim 6, is characterized in that, after the information that described collection is obtained, described described information is carried out before Data Analysis also comprising:
By the information structure queue of collecting and carry out Data Analysis by the order of queue.
9. distributed information acquisition methods according to claim 8, is characterized in that, described by the information structure queue of collecting and carry out Data Analysis by the order of queue and comprise:
Adopt asynchronous system by extremely described queue of the information updating of collecting.
10. distributed information acquisition methods according to claim 6, is characterized in that, also comprises:
According to the executing state of the allocating task of a plurality of described task execution devices, to a plurality of described task execution devices, distribute unenforced task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410371132.5A CN104102740A (en) | 2014-07-30 | 2014-07-30 | Distribution type information acquisition system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410371132.5A CN104102740A (en) | 2014-07-30 | 2014-07-30 | Distribution type information acquisition system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104102740A true CN104102740A (en) | 2014-10-15 |
Family
ID=51670893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410371132.5A Pending CN104102740A (en) | 2014-07-30 | 2014-07-30 | Distribution type information acquisition system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104102740A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930246A (en) * | 2016-04-08 | 2016-09-07 | 天翼阅读文化传播有限公司 | High available database monitoring method capable of intelligently distributing tasks |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030187907A1 (en) * | 2002-03-29 | 2003-10-02 | Takafumi Ito | Distributed control method and apparatus |
CN101741885A (en) * | 2008-11-19 | 2010-06-16 | 珠海市西山居软件有限公司 | Distributed system and method for processing task flow thereof |
CN102567086A (en) * | 2010-12-30 | 2012-07-11 | 中国移动通信集团公司 | Task scheduling method, equipment and system |
CN102592067A (en) * | 2011-01-17 | 2012-07-18 | 腾讯科技(深圳)有限公司 | Webpage recognition method, device and system |
CN102902785A (en) * | 2012-09-29 | 2013-01-30 | 合一网络技术(北京)有限公司 | Webpage information acquisition system and method |
CN102915254A (en) * | 2011-08-02 | 2013-02-06 | 中兴通讯股份有限公司 | Task management method and device |
WO2013030630A1 (en) * | 2011-09-02 | 2013-03-07 | Freescale Semiconductor, Inc. | Data processing system and method for task scheduling in a data processing system |
CN103294531A (en) * | 2012-03-05 | 2013-09-11 | 阿里巴巴集团控股有限公司 | Method and system for task distribution |
-
2014
- 2014-07-30 CN CN201410371132.5A patent/CN104102740A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030187907A1 (en) * | 2002-03-29 | 2003-10-02 | Takafumi Ito | Distributed control method and apparatus |
CN101741885A (en) * | 2008-11-19 | 2010-06-16 | 珠海市西山居软件有限公司 | Distributed system and method for processing task flow thereof |
CN102567086A (en) * | 2010-12-30 | 2012-07-11 | 中国移动通信集团公司 | Task scheduling method, equipment and system |
CN102592067A (en) * | 2011-01-17 | 2012-07-18 | 腾讯科技(深圳)有限公司 | Webpage recognition method, device and system |
CN102915254A (en) * | 2011-08-02 | 2013-02-06 | 中兴通讯股份有限公司 | Task management method and device |
WO2013030630A1 (en) * | 2011-09-02 | 2013-03-07 | Freescale Semiconductor, Inc. | Data processing system and method for task scheduling in a data processing system |
CN103294531A (en) * | 2012-03-05 | 2013-09-11 | 阿里巴巴集团控股有限公司 | Method and system for task distribution |
CN102902785A (en) * | 2012-09-29 | 2013-01-30 | 合一网络技术(北京)有限公司 | Webpage information acquisition system and method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930246A (en) * | 2016-04-08 | 2016-09-07 | 天翼阅读文化传播有限公司 | High available database monitoring method capable of intelligently distributing tasks |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463582B (en) | Distributed Hadoop cluster deployment method and device | |
TWI547817B (en) | Method, system and apparatus of planning resources for cluster computing architecture | |
CN107729564A (en) | A kind of distributed focused web crawler web page crawl method and system | |
CN106790718A (en) | Service call link analysis method and system | |
CN107070890A (en) | Flow data processing device and communication network major clique system in a kind of communication network major clique system | |
CN102147809B (en) | Parallel file system and management method thereof | |
CN104915259A (en) | Task scheduling method applied to distributed acquisition system | |
CN101741885A (en) | Distributed system and method for processing task flow thereof | |
CN107145556B (en) | Universal distributed acquisition system | |
CN108197486A (en) | Big data desensitization method, system, computer-readable medium and equipment | |
CN102222112B (en) | Resource management device and resource management method | |
CN102737065A (en) | Method and device for acquiring data | |
CN104699736A (en) | Distributed massive data acquisition system and method based on mobile devices | |
CN105491078B (en) | Data processing method and device, SOA system in SOA system | |
CN108121511A (en) | Data processing method, device and equipment in a kind of distributed edge storage system | |
CN102866424A (en) | Seismic data remote processing system based on cloud computing | |
CN117751567A (en) | Dynamic process distribution for utility communication networks | |
Wu et al. | Towards collaborative storage scheduling using alternating direction method of multipliers for mobile edge cloud | |
CN106302742B (en) | A kind of electrical power services resource information interactive system and method | |
CN103763353A (en) | Water conservation data exchange model and method | |
CN103023990A (en) | Image file upgrade system and method in stack system | |
CN101495978B (en) | Reduction of message flow between bus-connected consumers and producers | |
CN110879753A (en) | GPU-accelerated performance optimization method and system based on automated cluster resource management | |
CN104102740A (en) | Distribution type information acquisition system and method | |
CN112711522A (en) | Docker-based cloud testing method and system and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100041 Beijing, Shijingshan District Xing Xing street, building 30, room 3, building 9, room 9014 Applicant after: Jing Shuo Technology (Beijing) Limited by Share Ltd Address before: 100010, 1007, 9 floor, 1 Hutong, South bamboo alley, Beijing, Dongcheng District Applicant before: JINGSHUO CENTURY TECHNOLOGY (BEIJING) CO., LTD. |
|
CB02 | Change of applicant information | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20141015 |
|
WD01 | Invention patent application deemed withdrawn after publication |