Background technology
Along with the fast development of Internet technology, the information resources on the network are become stronger day by day and are the trend of quick growth, and increasing people likes obtaining information from network.Reptile (Spider, the Crawl etc.) program that all is based on traditional search engine information collecting method realizes, in certain period, has obtained certain achievement.But along with the renewal day by day, particularly Web2.0 of network service, Web3.0, Twitter, Facebook, the release of new networks such as microblogging service, traditional information collecting method can not satisfy the demand in epoch.
Retrieval through to the prior art document is found, the patent of Chinese patent document number CN100501746C, and date of declaration is on June 17th, 2009, has put down in writing a kind of " webpage grasping means and webpage grasp server ", this technology comprises: at first receive web-page requests; Next judges whether institute's requested webpage grasped; If do not grasp, then directly grasp; If grasped, then whether reach certain threshold value and determine whether attempting again grasping according to twice time interval, reached and then gone to grasp; Whether at last, upgrade according to webpage, whether decision is grasped again.This patent of invention technology is main with the traditional search engines acquisition method still, below main the existence:
1, waste Internet resources
Traditional information collecting method need repeat to sound out or the repeated acquisition network on information, whether be updated to judge the information on the network.Although also there is technology to judge whether to upgrade now through timestamp; Look into newly downloaded with realization; But this technology of judging based on timestamp is not supported in a large amount of network services, therefore can only take the information on retrial spy or the repeated acquisition network, has caused waste of network resources.
2, effective poor
In the face of the huge network information of quantity the time, traditional acquisition technique can only adopt certain particular acquisition point of way access of poll, therefore, often needs the above time interval at least 1 week just can be polled to the up-to-date information of some website, and is therefore ageing relatively poor.
3, Information Monitoring is not comprehensive
Because access authority limitation such as dynamic web page, user's login, traditional information acquiring technology are difficult to solve the comprehensive of information acquisition, therefore a large amount of network informations can't be collected.
4, dynamic data can't be gathered
For new network services such as forum, microblogging, Twitter, their answer number, browse data such as number and maybe the instant change, therefore possibly not collect these information change processes through traditional network collection method.
Embodiment
Below in conjunction with accompanying drawing the method and apparatus that the embodiment of the invention provided is carried out detailed description.
Embodiment one:
The embodiment of the invention provides a kind of internet information acquisition method based on the pushed technology, with reference to accompanying drawing 1, comprising:
S10, data acquisition side and data are by the side's of collection negotiation data acquisition protocols, wherein:
Described data acquisition side is meant a side of collection network information data, converges center etc. like the information of system; Data acquisition side receives the data message of being submitted to by the agreement of consulting by collection side passively, and stores in the corresponding storage medium, in equipment such as information-storing device.
Described data are meant the side that network information data are provided by the side of collection; Mainly comprise all kinds of entities that the internet information issuing service is provided such as portal website, forum, blog, social networks, microblogging, friend-making website; According to the data acquisition protocol of consulting, active push also is updated to data acquisition side to data by the side of collection.
Described data acquisition protocol, promptly data acquisition side is submitted to rule with data by the data that the side of collection consults to formulate, and wherein data are submitted to data acquisition side to data with structurized form by the side of collection by these rules.Wherein, The concrete rule of data acquisition protocol comprises that collection side (for example indicates, pushes frequency, collection channel, main obedient data; Information indicating, title, body matter, publisher, issuing time, answer number, browse number etc.), comment data (for example, comment content, reviewer, floor relation, comment time, comment on attribute), synchronous sequence etc.
S20, data are arrived data acquisition side to data by the particular data active push of the side of collection by the side of collection,
Concrete, data are gathered square tube and are crossed synchronization engine, and said particular data active push is arrived data acquisition side.
Described synchronization engine; Its function is can obtain data by the particular data of collection terminal, and by the data acquisition protocol that both sides consult, arrives data acquisition side to these data message active push; Concrete, this synchronization engine can be a hardware or software or the combination of the two.
Described particular data specifically refers to: in twice acquisition interval, the data that data were upgraded in the side of collection are browsed number, model like data, the model of up-to-date issue and are replied the data message that number etc. is stipulated.
Need to prove: in the embodiment of the invention and follow-up embodiment, active push is meant that data are initiatively sent to data acquisition side with particular data by the side of collection when satisfying regular that data acquisition protocol sets.
S30, data acquisition side receives the particular data that data are sent by the side of collection, and said particular data is stored, and comprising:
The information aggregating service of data acquisition side receives the data that synchronization engine pushes, and stores the data that collect through storage engines; Wherein:
Data acquisition can with the data storage that receives in large-capacity storage media.
Described information aggregating service, its function are the data that receive the synchronization engine active push concurrently.Under peripherals cooperates, can realize load balancing, capacity extension etc., can be hardware or software or the combination of the two.
Described storage engines, its function are on the structured large-capacity storage media that is stored in collection side of the data category of gathering.Can be a device or software or the combination of the two.
Described large-capacity storage media is specially, in a large number the memory device of storing data information.
In the internet information acquisition method that the embodiment of the invention provided based on the pushed technology; Data are gathered can be initiatively to send to data acquisition side with particular data according to the data acquisition protocol of consulting; Realize an internet information acquisition new method based on pushed technology, waste bandwidth resource not when this method can be implemented in collection network information, and information acquisition is more comprehensive; In time, also can collect special data simultaneously.
Embodiment two:
The embodiment of the invention provides a kind of internet information acquisition method based on the pushed technology, is example by collection side with typical forum wherein, and specific embodiment may further comprise the steps:
101, data are formulated data acquisition protocol by the side of collection and data acquisition side, and as shown in Figure 2, wherein, data are forum by the side of collection, and data sink concentrates the heart to be data acquisition side.In embodiments of the present invention, to the information issue characteristic of typical forum, forum and data sink concentrate the concrete data acquisition protocol of heart negotiation formation to be:
The obedient data of master of the sign of forum, forum's active push data frequency, the plate of gathering forum, forum (for example; Publisher, issuing time, the answer number of the sign of main card, main obedient title, the content of main card, main card, browse number etc.), to the comment of this master card or reply other data acquisition protocol of data (for example, to the comment content of this model, reviewer, floor relation, comment time, comment attribute etc.), synchronous negotiation appointments such as timestamp.
103, arrive data acquisition side to the particular data active push of forum through synchronization engine, referring to accompanying drawing 3, this process specifically comprises:
(1) forum's active inquiry data
Forum submits to forum to indicate (forum's URL (URL) address) to information aggregating service, active inquiry configuration data.Concrete configuration data comprises: pushes frequency, gathers the plate tabulation, be specially,
REQ (Request, request): (URL)
ACK (response): (5M; International observation, amusement and recreation ..., the stock market)
(2) the new submission of creating main card
It is obedient new master whether to occur in forum's inspection collection in per 5 minutes plate, if new main card then arrives collection side to the obedient relevant information active push of new master.The obedient data of master that push comprise main obedient sign, main obedient title, the content of main card, main obedient publisher, issuing time, synchronized timestamp etc., are specially,
REQ (Request, request): (main obedient URL; Happy birthday to wish motherland; Arrived National Day at once, wish motherland thriving and prosperous Sam001; 20110929; 20110929080500)
ACK (response): OK
(3) submission of comment or answer model
The content of main card can not change basically, but the moment all might produce main obedient new comment and reply data.For a popular main card, constantly all can take place to its comment or answer.Also can be checked whether have new answer data to occur or the new behavior of browsing occurs in per 5 minutes by collections side,, and be pushed to collection side if having then by the classification of main card sign.The data that push comprise that main card indicates, replys number, browses number, review record (comment content, reviewer, comment time, floor, answer floor, comment attribute), synchronized timestamp etc.
REQ (Request, request): (main obedient URL; 1024; 3231; (with wish, user01, x, 0,20110929 ,+1; Pass by, user02, x+1,0,20110929,0); 20110929080500)
ACK (response): OK
105, the information aggregating service of data acquisition side receives the data that synchronization engine pushes, with structurized stored in form in large-capacity storage media.
To present embodiment, collection side can push the data come to forum carry out structured storage, for the excavation in later stage, retrieval, analysis etc. provide the data support.To forum, need 3 tables of data of storage: main card, answer (comment), dynamic data etc.
Wherein main subsides storage is as shown in table 1, and answer (containing comment) storage is as shown in table 2, and dynamic data storage is as shown in table 3,
Table 1
Table 2
Table 3
Sequence number |
Main card indicates |
Reply number |
Browse number |
Timestamp |
i |
Main obedient URL |
1024 |
3231 |
20110929080500 |
i+1 |
Main obedient URL |
1055 |
3445 |
20110929081000 |
i+2 |
Main obedient URL |
1189 |
4007 |
20110929081500 |
The embodiment of the invention has been announced a kind of interaction flow when gathering typical forum data, comprises steps such as active data inquiry, the obedient submission of newly-built master, answer or review information propelling movement.Forum can be initiatively sends to data acquisition center according to data collecting rule with self data, and this collecting method can realize that network information gathering is more comprehensive, and collection network information that can be promptly and accurately, also can collect special data simultaneously.
Embodiment three:
The embodiment of the invention provides a kind of internet information acquisition device based on the pushed technology, comprising: data acquisition side 201 and data are by the side of collection 203, wherein:
Said data are used for the 201 negotiation data acquisition protocols with said data acquisition side by the side of collection 203, and are arrived said data acquisition side 201 to said data by the particular data active push of the side of collection 203 according to said data acquisition protocol;
Said data acquisition side 201 is used for being consulted said data acquisition protocol with said data by the side of collection 203, and receives the particular data that said data are sent by the side of collection 203, and said particular data is stored;
Wherein said data acquisition protocol is meant said data acquisition side 201 and data are consulted formulation by the side of collection 203 data submission rule; Said data acquisition side 201 is meant a side of collection network information data; Said data are meant the side that network information data are provided by the side of collection 203, and said particular data is meant in twice acquisition interval the data of upgrading on by the side of collection 203 in data.
In another one embodiment of the present invention, this device also comprises:
Synchronization engine is used to obtain data by the particular data of the side of collection, and by the data acquisition protocol that both sides consult, said particular data active push is arrived said data acquisition side.
In another one embodiment of the present invention, this device also comprises information aggregating service and storage engines, wherein:
Said information aggregating service is used to receive the said particular data that said synchronization engine sends, and through said storage engines said particular data is stored;
Said storage engines is used for structured storage of particular data category that receives.
In another one embodiment of the present invention, said data according to said data acquisition protocol, are pushed to said data acquisition side by the particular data of the side of collection with structurized form with said data by the side of collection.
In another one embodiment of the present invention, the concrete rule of said data acquisition protocol comprises that data acquisition side indicates, pushes at least one in frequency, collection channel, main obedient data, comment data, the synchronous sequence.
In the internet information acquisition device that the embodiment of the invention provided based on the pushed technology; Data are gathered can be initiatively to send to data acquisition side with particular data according to the data acquisition protocol of consulting; Realize an internet information acquisition new method based on pushed technology, waste bandwidth resource not when this device can be implemented in collection network information, and information acquisition is more comprehensive; In time, also can collect special data simultaneously.
More than be some preferred implementation of the embodiment of the invention; Anyone is under the prerequisite of skilled; Do not deviating from spirit of the present invention and do not exceeding under the prerequisite of the technical scope that the present invention relates to, can do various replenishing and modification the details that the present invention describes.Protection scope of the present invention is not limited to the cited scope of embodiment, and protection scope of the present invention is as the criterion with claim.