[go: up one dir, main page]

CN104102740A - Distribution type information acquisition system and method - Google Patents

Distribution type information acquisition system and method Download PDF

Info

Publication number
CN104102740A
CN104102740A CN201410371132.5A CN201410371132A CN104102740A CN 104102740 A CN104102740 A CN 104102740A CN 201410371132 A CN201410371132 A CN 201410371132A CN 104102740 A CN104102740 A CN 104102740A
Authority
CN
China
Prior art keywords
task
information
data analysis
devices
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410371132.5A
Other languages
Chinese (zh)
Inventor
洪倍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JINGSHUO CENTURY TECHNOLOGY (BEIJING) Co Ltd
Original Assignee
JINGSHUO CENTURY TECHNOLOGY (BEIJING) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JINGSHUO CENTURY TECHNOLOGY (BEIJING) Co Ltd filed Critical JINGSHUO CENTURY TECHNOLOGY (BEIJING) Co Ltd
Priority to CN201410371132.5A priority Critical patent/CN104102740A/en
Publication of CN104102740A publication Critical patent/CN104102740A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a distribution type information acquisition system and method. The distribution type information acquisition system comprises one or a plurality of request generation devices, one or a plurality of task management devices, a plurality of task execution devices, an information collection device, one or a plurality of data analysis devices and a data storage device, wherein the request generation devices used for generating task requests for obtaining information; the task management devices are used for confirming the priority of a task according to the task request and assigning the task; the task execution devices are used for executing the task according to the distribution of a task manager to obtain information; the information collection device is used for collecting information obtained by the task execution devices; the data analysis device is used for carrying out data analysis to the information; and the data storage device is used for storing the information subjected to the data analysis. The information on an internet is captured by the low-cost task execution devices of the distribution type information acquisition system.

Description

Distributed information obtains system and method
Technical field
The present invention relates to distributed system, in particular to distributed information, obtain system and method.
Background technology
Along with the development of infotech, the information on network is more and more, by the network search engines information of searching, has become the important information source in people's life.Network search engines mainly utilizes network information acquisition device to obtain the information on network.
At present, known network information acquisition device (for example web crawlers) is to use separate unit PC (generally adopting X86-based) or server to carry out data acquisition work, again by the data storing collecting in data storage module, but the capture apparatus that this kind of mode adopts is single, thereby causing information acquisition efficiency not high, the spent time is long.Because the equipment (as PC, server etc.) using is expensive, cost is high again, therefore to information acquisition work, makes troubles.
Summary of the invention
In order to address the above problem, the invention provides a kind of distributed information and obtain system, comprising: one or more request generating device, for generating the task requests of obtaining information; One or more task management devices, for priority the allocating task setting the tasks according to described task requests; A plurality of task execution devices, for carrying out described task with obtaining information according to the distribution of described task manager; Information collection apparatus, the information of obtaining for collecting a plurality of described task execution devices; One or more data analysis devices, for carrying out Data Analysis to described information; And data memory device, for storing the information of carrying out Data Analysis.
Preferably, described task management device also comprises: task persistence device, and for task requests described in each is stored to described data memory device.
Preferably, also comprise: message queue device, for information structure queue that described information collection apparatus is collected and by the order of queue, described information is committed to described data analysis device.
Preferably, described information collection apparatus adopts asynchronous system that the information of collecting is sent to described message queue device.
Preferably, described a plurality of task execution device also feeds back the executing state of task execution device described in each to described task management device and distributes unenforced task for described task management device.
According to another aspect of the invention, also provide a kind of distributed information acquisition methods, comprising: the task requests that generates obtaining information; The priority setting the tasks according to described task requests to a plurality of task execution device allocating tasks; Described a plurality of task execution device is carried out described task with obtaining information; The information that collection is obtained; Described information is carried out to Data Analysis; And the information of Data Analysis was carried out in storage.
Preferably, the described priority setting the tasks according to described task requests also comprising to a plurality of task execution device allocating tasks: task requests described in each is stored to a data memory device.
Preferably, after the information that described collection is obtained, described described information is carried out before Data Analysis also comprising: by the information structure queue of collecting and carry out Data Analysis by the order of queue.
Preferably, described by the information structure queue of collecting and carry out Data Analysis by the order of queue and comprise: adopt asynchronous system by the information updating of collecting to described queue.
Preferably, also comprise: according to the executing state of the allocating task of a plurality of described task execution devices, to a plurality of described task execution devices, distribute unenforced task.
The method that the present invention obtains by distributed information, task of utilizing task management device to obtain according to task priority assignment information to a plurality of task execution devices, after task execution device obtaining information, by asynchronous mode, by information collection apparatus, gathered and form the queue with lastest imformation and analyze for data analysis device, and result is stored to data memory device.Many task execution devices do not need X86 system, can reduce costs.The present invention captures the information on network by many low-cost task execution devices of above-mentioned distributed system management.
Accompanying drawing explanation
By describe its example embodiment in detail with reference to accompanying drawing, above-mentioned and further feature of the present invention and advantage will become more obvious.
The distributed information that Fig. 1 illustrates first embodiment of the invention obtains the structural representation of system; And
Fig. 2 illustrates the process flow diagram of the distributed information acquisition methods of first embodiment of the invention.
Embodiment
Referring now to accompanying drawing, example embodiment is more fully described.Yet example embodiment can be implemented in a variety of forms, and should not be understood to be limited to embodiment set forth herein; On the contrary, provide these embodiments to make the present invention by comprehensive and complete, and the design of example embodiment is conveyed to those skilled in the art all sidedly.
The distributed information that Fig. 1 illustrates first embodiment of the invention obtains the structural representation of system.Distributed information obtains system and comprises a plurality of request generating device 102, a plurality of task management device 104, a plurality of task execution device 106, information collection apparatus 108, a plurality of data analysis device 112 and a data memory device 114.A plurality of request generating device 102 and a plurality of task management device 104 connecting communications.A plurality of task management devices 104 and a plurality of task execution device 106 connecting communications.A plurality of task execution devices 106 and information collection apparatus 108 connecting communications.Information collection apparatus 108 and a plurality of data analysis device 112 connecting communications.A plurality of data analysis devices 112 and data memory device 114 connecting communications.Wherein, the connected mode between each device can be wired connection or wireless connections, and the mode of wireless connections includes infrared connection, bluetooth connection, LAN (Local Area Network) connection, internet connection etc.Communication between each device preferably, is used http protocol.
Particularly, it is corresponding with a plurality of task execution devices 106 that the present embodiment illustrates a plurality of task management devices 104, and the number of task management device 104 is identical with the number of task execution device 106.Preferably, a plurality of task execution devices 106 are divided into different execution modules according to region or route distance, and each execution module is managed by the task management device 104 of respective numbers.At one, change in example, task management device 104 is not identical with the number of task execution device 106, and usually, the number of task management device 104 is less than task execution device 106.A task management device 104 is managed predetermined a plurality of task execution devices 106 or manages according to real-time task execution devices 106 of must selecting such as the network bandwidth, task definitions.Further, the number of request generating device 102, task management device 104, data analysis device 112 can be one.
Request generating device 102 generates the task requests of obtaining information, and preferably, request generating device 102 is obtained task requests with user interactions.At one, change in example, request generating device 102 generates task requests by Event triggered.A request generating device 102 can generate a plurality of task requests.Originally illustrate a plurality of request generating device 102.A plurality of request generating device 102 are sent to a plurality of task management devices 104 by the list of the task requests of generation.The priority that a plurality of task management devices 104 set the tasks according to task requests to task execution device allocating task.Wherein, the priority of task preferably, is determined by user, and is sent to together task management device 104 by the list of task requests.At one, change in example, the priority that task management device 104 sets the tasks according to the state of the content of task and task execution device 104 or predetermined task strategy, for example, predetermined strategy is the preferential large task of the amount of finishing the work, and task management device 104 sorts to task priority by workload.Again for example, predetermined strategy is for preferentially to complete the task that flow is less, and task management device 104 sorts to task priority by flow.Task management device 104 has been determined after the priority of task, different tasks has been dispensed to different task execution device 106.Task execution device 106 is executed the task with obtaining information according to the priority of task.The information that a plurality of task execution devices 106 of information collection apparatus 108 collection obtain is also sent to data analysis device 112 by the information of obtaining.112 pairs of information of obtaining of a plurality of data analysis devices carry out Data Analysis and by the data storing after resolving to data memory device 114.Wherein, task execution device 106 also feeds back executing states to task management device 104, for example, during whether task is carried out, network environment of execution etc.
This figure also shows a task persistence device 116, and task persistence device 116 is stored to data memory device 114 for the task list that task management device 104 is obtained, to promote the stability of whole device.
This figure also shows a message queue device 110, and message queue device 110 is for information structure queue that information collection apparatus 108 is collected and by the order of queue, information is committed to data analysis device 112.Wherein, information collection apparatus 108 adopts asynchronous mode that the information of collecting is sent to message queue device 110.
Preferably, distributed information provided by the present invention obtains system and is preferably applied to crawler system, and request generating device 102 generates the task requests of obtaining information according to the keyword of user's input.And this task requests is divided into a task list of a plurality of tasks formation.Request generating device 102 is sent to a plurality of task management devices 104 by the task list of generation.The priority that a plurality of task management devices 104 set the tasks according to task requests to task execution device 106 allocating tasks.Task execution device 106 captures information according to the priority of task.Information collection apparatus 108 is collected the information of a plurality of task execution devices 106 crawls and the information of crawl is sent to data analysis device 112.112 pairs of information of obtaining of a plurality of data analysis devices are carried out Data Analysis and are obtained result URL and the URL after resolving is stored to data memory device 114.
Fig. 2 illustrates the process flow diagram of the distributed information acquisition methods of first embodiment of the invention.Distributed information obtains system and comprises a plurality of request generating device, a plurality of task management device, task persistence device, a plurality of task execution device, an information collection apparatus, message queue device, a plurality of data analysis device and a data memory device.A plurality of request generating device and a plurality of task management device connecting communication.A plurality of task management devices and task persistence device and a plurality of task execution device connecting communication.A plurality of task execution devices and an information collection apparatus connecting communication.Information collection apparatus is by a message queue device and a plurality of data analysis device connecting communication.A plurality of data analysis devices and a data memory device connecting communication.Particularly, originally illustrate 8 steps.
Step S101, the task requests of generation obtaining information.
Particularly, task requests is generated by request generating device, and preferably, request generating device and user interactions obtain task requests.At one, change in example, task requests is generated by Event triggered.A task requests can be split as a task requests list and carry out for a plurality of task execution devices.One changes in example, and a plurality of task requests form a task requests list and carried out by a plurality of task execution devices.
Step S102, the priority setting the tasks according to task requests to task execution device allocating task.
Wherein, this step is completed by task management device.The priority of task preferably, is determined by user, and is sent to together a task management device by the list of task requests.At one, change in example, the priority that task management device sets the tasks according to the state of the content of task and task execution device or predetermined task strategy, for example, predetermined strategy is the preferential large task of the amount of finishing the work, and task management device sorts to task priority by workload.Again for example, predetermined strategy is for preferentially to complete the task that flow is less, and task management device sorts to task priority by flow.Task management device has been determined after the priority of task, different tasks has been dispensed to different task execution devices.
Step S103, executes the task with obtaining information according to the priority of task.
The task list that task persistence device obtains task management device is stored to data memory device, to promote the stability of whole device.
Step S104, task execution device is executed the task with obtaining information according to the priority of task.
Step S105, information collection apparatus is collected the information that a plurality of task execution devices obtain.
Step S106, information collection apparatus adopts asynchronous framework that the information of obtaining is sent to message queue device, and message queue device is by the information structure obtaining or renewal queue and according to the order of queue, information is sent to data analysis device.
Step S107, data analysis device the information of obtaining is carried out to Data Analysis and by the information storage after resolving to data memory device.
Step S108, task execution device also feeds back executing state to task management device, for example, and during whether task is carried out, network environment of execution etc.Particularly, this step is not limited in after step S107, and it can carry out in any time of the method.
Although the present invention discloses as above with preferred embodiment, yet it is not in order to limit the present invention.Those skilled in the art, without departing from the spirit and scope of the present invention, when doing various changes and modification.Therefore the scope that, protection scope of the present invention ought define depending on claims is as the criterion.

Claims (10)

1. distributed information obtains a system, it is characterized in that, comprising:
One or more request generating device, for generating the task requests of obtaining information;
One or more task management devices, for priority the allocating task setting the tasks according to described task requests;
A plurality of task execution devices, for carrying out described task with obtaining information according to the distribution of described task manager;
Information collection apparatus, the information of obtaining for collecting a plurality of described task execution devices;
One or more data analysis devices, for carrying out Data Analysis to described information; And
Data memory device, for storing the information of carrying out Data Analysis.
2. distributed information according to claim 1 obtains system, it is characterized in that, described task management device also comprises:
Task persistence device, for being stored to described data memory device by task requests described in each.
3. distributed information according to claim 1 obtains system, it is characterized in that, also comprises:
Message queue device, for information structure queue that described information collection apparatus is collected and by the order of queue, described information is committed to described data analysis device.
4. distributed information according to claim 3 obtains system, it is characterized in that, described information collection apparatus adopts asynchronous system that the information of collecting is sent to described message queue device.
5. distributed information according to claim 1 obtains system, it is characterized in that, described a plurality of task execution devices also feed back the executing state of task execution device described in each to described task management device and distribute unenforced task for described task management device.
6. a distributed information acquisition methods, is characterized in that, comprising:
Generate the task requests of obtaining information;
The priority setting the tasks according to described task requests to a plurality of task execution device allocating tasks;
Described a plurality of task execution device is carried out described task with obtaining information;
The information that collection is obtained;
Described information is carried out to Data Analysis; And
The information of Data Analysis was carried out in storage.
7. distributed information acquisition methods according to claim 6, is characterized in that, the described priority setting the tasks according to described task requests also also comprises to a plurality of task execution device allocating tasks:
Task requests described in each is stored to a data memory device.
8. distributed information acquisition methods according to claim 6, is characterized in that, after the information that described collection is obtained, described described information is carried out before Data Analysis also comprising:
By the information structure queue of collecting and carry out Data Analysis by the order of queue.
9. distributed information acquisition methods according to claim 8, is characterized in that, described by the information structure queue of collecting and carry out Data Analysis by the order of queue and comprise:
Adopt asynchronous system by extremely described queue of the information updating of collecting.
10. distributed information acquisition methods according to claim 6, is characterized in that, also comprises:
According to the executing state of the allocating task of a plurality of described task execution devices, to a plurality of described task execution devices, distribute unenforced task.
CN201410371132.5A 2014-07-30 2014-07-30 Distribution type information acquisition system and method Pending CN104102740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410371132.5A CN104102740A (en) 2014-07-30 2014-07-30 Distribution type information acquisition system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410371132.5A CN104102740A (en) 2014-07-30 2014-07-30 Distribution type information acquisition system and method

Publications (1)

Publication Number Publication Date
CN104102740A true CN104102740A (en) 2014-10-15

Family

ID=51670893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410371132.5A Pending CN104102740A (en) 2014-07-30 2014-07-30 Distribution type information acquisition system and method

Country Status (1)

Country Link
CN (1) CN104102740A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930246A (en) * 2016-04-08 2016-09-07 天翼阅读文化传播有限公司 High available database monitoring method capable of intelligently distributing tasks

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187907A1 (en) * 2002-03-29 2003-10-02 Takafumi Ito Distributed control method and apparatus
CN101741885A (en) * 2008-11-19 2010-06-16 珠海市西山居软件有限公司 Distributed system and method for processing task flow thereof
CN102567086A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Task scheduling method, equipment and system
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN102902785A (en) * 2012-09-29 2013-01-30 合一网络技术(北京)有限公司 Webpage information acquisition system and method
CN102915254A (en) * 2011-08-02 2013-02-06 中兴通讯股份有限公司 Task management method and device
WO2013030630A1 (en) * 2011-09-02 2013-03-07 Freescale Semiconductor, Inc. Data processing system and method for task scheduling in a data processing system
CN103294531A (en) * 2012-03-05 2013-09-11 阿里巴巴集团控股有限公司 Method and system for task distribution

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187907A1 (en) * 2002-03-29 2003-10-02 Takafumi Ito Distributed control method and apparatus
CN101741885A (en) * 2008-11-19 2010-06-16 珠海市西山居软件有限公司 Distributed system and method for processing task flow thereof
CN102567086A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Task scheduling method, equipment and system
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN102915254A (en) * 2011-08-02 2013-02-06 中兴通讯股份有限公司 Task management method and device
WO2013030630A1 (en) * 2011-09-02 2013-03-07 Freescale Semiconductor, Inc. Data processing system and method for task scheduling in a data processing system
CN103294531A (en) * 2012-03-05 2013-09-11 阿里巴巴集团控股有限公司 Method and system for task distribution
CN102902785A (en) * 2012-09-29 2013-01-30 合一网络技术(北京)有限公司 Webpage information acquisition system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930246A (en) * 2016-04-08 2016-09-07 天翼阅读文化传播有限公司 High available database monitoring method capable of intelligently distributing tasks

Similar Documents

Publication Publication Date Title
CN107463582B (en) Distributed Hadoop cluster deployment method and device
TWI547817B (en) Method, system and apparatus of planning resources for cluster computing architecture
CN107729564A (en) A kind of distributed focused web crawler web page crawl method and system
CN106790718A (en) Service call link analysis method and system
CN107070890A (en) Flow data processing device and communication network major clique system in a kind of communication network major clique system
CN102147809B (en) Parallel file system and management method thereof
CN104915259A (en) Task scheduling method applied to distributed acquisition system
CN101741885A (en) Distributed system and method for processing task flow thereof
CN107145556B (en) Universal distributed acquisition system
CN108197486A (en) Big data desensitization method, system, computer-readable medium and equipment
CN102222112B (en) Resource management device and resource management method
CN102737065A (en) Method and device for acquiring data
CN104699736A (en) Distributed massive data acquisition system and method based on mobile devices
CN105491078B (en) Data processing method and device, SOA system in SOA system
CN108121511A (en) Data processing method, device and equipment in a kind of distributed edge storage system
CN102866424A (en) Seismic data remote processing system based on cloud computing
CN117751567A (en) Dynamic process distribution for utility communication networks
Wu et al. Towards collaborative storage scheduling using alternating direction method of multipliers for mobile edge cloud
CN106302742B (en) A kind of electrical power services resource information interactive system and method
CN103763353A (en) Water conservation data exchange model and method
CN103023990A (en) Image file upgrade system and method in stack system
CN101495978B (en) Reduction of message flow between bus-connected consumers and producers
CN110879753A (en) GPU-accelerated performance optimization method and system based on automated cluster resource management
CN104102740A (en) Distribution type information acquisition system and method
CN112711522A (en) Docker-based cloud testing method and system and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100041 Beijing, Shijingshan District Xing Xing street, building 30, room 3, building 9, room 9014

Applicant after: Jing Shuo Technology (Beijing) Limited by Share Ltd

Address before: 100010, 1007, 9 floor, 1 Hutong, South bamboo alley, Beijing, Dongcheng District

Applicant before: JINGSHUO CENTURY TECHNOLOGY (BEIJING) CO., LTD.

CB02 Change of applicant information
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141015

WD01 Invention patent application deemed withdrawn after publication