CN101950312A

CN101950312A - Method for analyzing webpage content of internet

Info

Publication number: CN101950312A
Application number: CN 201010512730
Authority: CN
Inventors: 赵清政
Original assignee: Individual
Current assignee: Individual
Priority date: 2010-08-18
Filing date: 2010-10-20
Publication date: 2011-01-19
Anticipated expiration: 2030-10-20
Also published as: CN101950312B

Abstract

The invention provides a method for analyzing the webpage content of the internet, which belongs to the technical field of a network. The method comprises the following steps of: firstly, initializing a webpage template library and reading a webpage to be analyzed; secondly, judging whether the webpage to be analyzed is generated by a template or not according to the url of the webpage, if the webpage to be analyzed is not generated by the template, analyzing the webpage according to the common mode, if the webpage to be analyzed is generated by the template, generating a hash value for the catalogue of the webpage to be analyzed, searching whether the value exists in a template hash table of the webpage template library or not according to the generated hash value, if so, analyzing the webpage to be analyzed according to the template corresponding to the value, and otherwise searching a webpage with the same type as that of the webpage to be analyzed and generating the template corresponding to the webpage to be analyzed by using the searched same type webpage; and finally, analyzing the webpage to be analyzed by using the template. The method has the advantages of greatly improving the analyzing accuracy of the webpage and greatly improving the analyzing effect of the webpage.

Description

A kind of method for analyzing internet web page contents

Technical field

The invention belongs to networking technology area, be specifically related to a kind of method for analyzing internet web page contents.

Background technology

In recent years, be accompanied by the popularizing of network, the lifting of bandwidth, the maturation of service mode, search engine becomes the mainstream applications of internet gradually.Technically, internet search engine generally is made of two parts, i.e. processed offline part and online treatment part.The processed offline part comprises that mainly webpage grasps, main functional modules such as index are resolved and set up to webpage, and online treatment module flow process comprises: according to the query word of user's submission, the corresponding document of inquiry in index that the processed offline module generates and data, it is webpage, and with the document that inquires according to certain index ordering, the result after the ordering returns to the user the most at last.

In the whole service process of search engine, webpage is resolved and to be brought into play basic key effect, and it in fact determines which data and content with generating index, thereby can finally be arrived by user inquiring.Because technology and commerce, the content in each current webpage is all very complicated, except the content that really will express of webpage, and a lot of irrelevant informations of also having mixed, for example advertising message, rubbish link etc.Experience because the accuracy that webpage is resolved has influenced the final user of search engine service to a great extent, have a variety of methods to be invented in order to improve the parsing to web page contents at present, these two kinds of methods can classify as:

First kind of mode with character stream according to each label and the positional information in webpage, counts the feature of various piece, goes out the title and the text of webpage and other parts according to their signature analysis.

Second kind is the method with DOM Document Object Model (Document Object Model is called for short dom) tree.Build the dom tree according to original web page earlier, judge the content of webpage at the attribute of relatively setting each node.

These two kinds of methods all are that utilization one group of rule that formulation is good is in advance chosen some content in the webpage in essence.Regrettably, the arranged mode of webpage is multifarious now, can't be exhaustive, these methods exist in actual motion may be suitable for some webpage, and the defective of inapplicable other webpage makes the net result of webpage parsing or has junk information, has perhaps lost real Useful Information.

Summary of the invention

The present invention is directed to present method for analyzing internet web page contents and can not be suitable for the inaccurate problem of result of whole webpages and parsing fully, a kind of method for analyzing internet web page contents is provided.

Method for analyzing internet web page contents provided by the invention, it comprises the steps:

Webpage to be resolved is read in step 1, initialization web page template storehouse; A template Hash table is set up in described Web page module storehouse, all corresponding template of each ident value dirID of record in this template look-up table;

Step 2, judge according to the URL(uniform resource locator) url of webpage to be resolved whether webpage to be resolved is generated by template, if not, execution in step three, otherwise execution in step four;

Step 3, resolve this webpage, obtain analysis result according to common mode;

Step 4, at the catalogue of webpage to be resolved, generate an ident value dirID by hash method to it, and in the template look-up table in web page template storehouse, search whether there is corresponding dirID, if having execution in step six, otherwise execution in step five;

Step 5, find other webpages of the same type with webpage to be resolved, with generating template, generate the required webpage number of template otherwise less than the threshold value of minimum webpage, generate the template corresponding according to all webpages that get access to webpage to be resolved, if adopt a fingerprint Hash table to write down the eigenwert of the rubbish piecemeal of all templates in the web page template storehouse, then upgrade the fingerprint Hash table, if set up a fingerprint Hash table at the template of each generation, then preserve the fingerprint Hash table of the template correspondence of the generation of setting up, and add in the template Hash table in web page template storehouse by the catalogue corresponding identification value dirID with webpage to be resolved, the template that generates is joined in the web page template storehouse;

Step 6, utilize the template corresponding to resolve the content of webpage to be resolved with the catalogue corresponding identification value dirID of webpage to be resolved, obtain analysis result, specifically: the content of webpage to be resolved is carried out piecemeal, and generate a cryptographic hash for each piecemeal according to the content of each piecemeal, at each cryptographic hash, search in the fingerprint Hash table of the fingerprint Hash table in the web page template storehouse or the template of webpage correspondence to be resolved and whether have this cryptographic hash, if exist, then the piecemeal to this cryptographic hash correspondence does not deal with, if there is no, then extract the piecemeal content of this cryptographic hash correspondence, all piecemeal contents of extracting have constituted the content of analysis result.

Webpage of the same type described in the step 5 is meant the webpage that has same directory in the static Web page, or in the dynamic web page, with the webpage under the same basic class under the inlet.

The generation of the template described in the step 5 specifically may further comprise the steps: steps A, the content of all webpages that get access to is all carried out piecemeal; Step B, according to the content of each piece, all generate an eigenwert for each piecemeal, this eigenwert adopts hash method to generate; Step C, according to the eigenwert of piecemeal, the frequency of occurrences of adding up every kind of piecemeal; Step D, the frequency of occurrences is labeled as the rubbish piecemeal greater than the piecemeal of pre-set threshold, each rubbish piecemeal characteristic of correspondence value is saved in the fingerprint Hash table; Step e, if set up a fingerprint Hash table at the template of each generation, then be that the catalogue of webpage to be resolved is related with corresponding fingerprint Hash table foundation.

Content to webpage in described steps A and the step 6 is carried out piecemeal, should guarantee the consistance and the indeformable property of segmenting web page, carries out the nature cutting with label tr, td and div, and the length setting is no less than 20 bytes; The simple part of structure of web page is cut into bulk, and length is not limit.

The pre-set threshold of described step D, minimum value are 3, greater than 30 o'clock, are n at n ^0.3The value that rounds up, but maximum occurrences is 10, wherein n is the number that generates being used to of getting access to the webpage of template.

A kind of method for analyzing internet web page contents provided by the invention, automatically whether analyzing web page is generated by template, and can find the template corresponding automatically with webpage, thereby utilize the most adaptive template to come analyzing web page, can improve the accuracy that webpage is resolved widely, significantly improve the effect of web page analysis.

Description of drawings

Fig. 1 is the flow chart of steps of internet web page analytic method of the present invention;

Fig. 2 is the process flow diagram that template generates in the step 5 in the internet web page analytic method of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with drawings and Examples.

The objective of the invention is to can not accurate Analysis to whole webpages at prior art, the unfavorable defective of analysis result, provide a kind of can be at the different channel page or leaf of each website even each website, the method for analyzing internet web page contents of carrying out the analysis and the processing of webpage with method targetedly.

A kind of method for analyzing internet web page contents of the present invention as shown in Figure 1, specifically may further comprise the steps:

Webpage to be resolved is read in step 1, initialization web page template storehouse.

For example, the URL(uniform resource locator) of webpage to be resolved (Uniform Resource Locator is called for short url) is news.sina.com.cn, needs to read the url and the corresponding original web page of this webpage so.

In original state, the template number in the Web page module storehouse is 0.Each module in the web page template storehouse all adopts an ident value dirID to identify, and all dirID are kept in the template Hash table, and described template Hash table adopts the mode of Hash (hash) table to store data.

Step 2, judge that whether webpage to be resolved is generated by template, if not, execution in step three, otherwise execution in step four;

According among the url that checks webpage to be resolved whether except " // " behind the http, also have the sign "/" of catalogue, judge that whether webpage to be resolved is generated by template, generate if exist, if do not exist with regard to being not to generate by template by template.

For example, webpage to be resolved: news.sina.com.cn judges according to url whether this webpage is that template generates.From the url of this webpage as can be seen this webpage be the news channel page or leaf of sina.com.cn, be not that template generates, when resolving this webpage,, change step 3 and carry out owing to be not that template generates.

For example, this webpage to be resolved of http://news.sina.com.cn/h/2010-07-15/141820685517.shtml, the catalogue that is easy to judge it according to the url of this webpage is " http://news.sina.com.cn/h/2010-07-15 ", it is the part before last "/", because of this webpage is generated by template, need change step 4 when resolving this webpage and carry out.

At webpage http://item.taobao.com/item.htm? id=6660646078﹠amp; Cm_cat=110207, its catalogue is http://item.taobao.com, this webpage is generated by template.

Step 3, resolve this webpage, obtain analysis result, finish the resolving of this webpage by common mode.Described common mode refers to adopt the mode of character stream or adopts the method for dom tree, and utilization one group of rule that formulation is good is in advance chosen the feature in the webpage.

Step 4, judge whether there has been the template that is complementary with webpage to be resolved in the web page template storehouse, if, execution in step six, otherwise execution in step five.

Catalogue at webpage to be resolved generates an ident value dirID, and identical catalogue has identical dirID.The embodiment of the invention adopts hash method to generate dirID, for catalogue " http://news.sina.com.cn/h/2010-07-15 ", suppose that the dirID that generates according to this catalogue is 14130464512028122877, be expressed as 0xc4197e9b76b31efd with 16 systems, the ident value dirID that represents with 16 systems inquires about in the template Hash table in Web page module storehouse, if should not be worth in the template Hash table, then there is not pairing template in the web page template storehouse, changeing step 5 carries out, if there is this value in the template Hash table in Web page module storehouse, then there is pairing template in the web page template storehouse, changes step 6 and carry out.

Step 5, find the webpage of the same type of webpage to be resolved, generate the template corresponding, the ident value dirID of the catalogue of webpage to be resolved is joined in the template Hash table in web page template storehouse and go with webpage to be resolved.

Webpage of the same type at static Web page, generally is meant the webpage under the same directory, at the webpage of dynamic generation, is meant with all webpages under the same basic class under the inlet.

At the webpage of dynamic generation, define that all webpages under the same little type under the identical inlet are a basic class on the website, described same little type is meant the type that can be used to classify.For example: http://item.taobao.com/item.htm? id=4283563695﹠amp; Cm_cat=110207 and urlhttp: //item.taobao.com/item.htm? id=6660646078﹠amp; Cm_cat=110207, the url of these two webpages belong to cm_cat=110207, and cm_cat=110207 is exactly a basic class.

When generating new template, at first need other webpages of finding webpage to be resolved of the same type, generate the needed webpage number of new template and be greater than the threshold value that equals to generate the required minimum webpage of template, comprise webpage to be resolved according to all webpages that get access to then, generate the template corresponding with webpage to be resolved.The threshold value of the required minimum webpage of described generation template is an integer, and minimum is 3, considers based on the angle of probability, and is the more the better, generally gets 10 more than the webpage.If the webpage of getting under the same catalogue very little, the extraction of template will keep the Template Information that should filter out.It is 3 that template generates the webpage that needs minimum, and this moment, the prerequisite of acquiescence was: have only a template under this catalogue, do not have other nested templates, or these 3 webpages belong to same template.

According to known web pages: http://news.sina.com.cn/h/2010-07-15/075320682851.shtml, its web page contents is analyzed, find out the webpage of the same type of more this webpage as far as possible.If do not comprise other webpages of the same type in this web page contents, or the number that comprises webpage of the same type is less than the threshold value of the minimum webpage of setting up the template needs, so just need to seek the path of finding other webpages, at first search in this website, the template that has or not other, if have, just use for reference this website and generate the mode of finding other webpage paths when having template, seek the url of other webpages according to relative path, for example already present template finds that the path of a webpage under the corresponding catalogue is: http://news.sina.com.cn/h/2010-08-26/105620979437.shtml, then the webpage of relative path is http://news.sina.com.cn/h/2010-07-15/105620979437.shtml, whether the webpage of checking this relative path exists, if exist then find a webpage of the same type.If do not set up other template under this website, that just sees that other website has or not the ready-made form can be for reference, described ready-made form refers to find from a url method of a plurality of webpage url of the same type, if have, just go this method on probation one by one, search the webpage of relative path, up to other webpages that find under this enough catalogue, if do not have, that just seeks higher level's url in this webpage, continue to seek other webpages under this catalogue from higher level url, enough up to searching out to generate the webpage number of template.And write down the path of searching, so that the foundation of these other templates of website and enrich the means of the foundation of other website templates.

Find the webpage of the same type of 9 webpages to be resolved in the embodiment of the invention, 10 webpages that comprise webpage to be resolved are analyzed, generate the template corresponding with webpage to be resolved.At last, with the dirID:0xc4197e9b76b31efd of the catalogue of webpage http://news.sina.com.cn/h/2010-07-15/075320682851.shtml, join in the template Hash table in web page template storehouse and go.

As shown in Figure 2, the method for the corresponding templates of webpage to be resolved generation is specially:

Steps A, the page of all webpages that get access to is all carried out piecemeal;

Step B, all generate an eigenwert for each piecemeal according to the content of each piecemeal, piece fingerprint just, the piece fingerprint adopts hash method to generate, and each piece is represented with a cryptographic hash, a plurality of cryptographic hash of each page correspondence;

Step C, according to the eigenwert of piecemeal, the frequency of occurrences of adding up every kind of piecemeal;

Step D, the frequency of occurrences is labeled as the rubbish piecemeal greater than the piecemeal of pre-set threshold, all rubbish piecemeals are formed the rubbish block collection, and each rubbish piecemeal characteristic of correspondence value in the rubbish block collection is saved in the template base;

Step e is also corresponding with corresponding rubbish block collection with the catalogue of webpage to be resolved, and the catalogue that is webpage to be resolved is set up related with corresponding fingerprint Hash table.

For the piecemeal of above-mentioned steps A webpage, need carry out cutting to webpage according to certain rule, guarantee the consistance of cutting and the non-possibility of accidental collision.Webpage all has a lot of structurized data to constitute, and such as the p node, a node, the rower of webpage are signed tr, column label td, layer label div etc., also are that this type of structured of using webpage itself is analyzed when coming analyzing web page with the thought of rule.

In general, the big more speed that deals with of the piece of cutting is also just fast more, because piece has lacked, data volume is also just few, but accuracy rate is just low more, because piece is big, the template part that just may comprise part also has been used as the personal characteristics of webpage to the template corresponding part, and recall rate is just high more.For example each webpage is handled as a piecemeal, recall rate must be 100%.Corresponding: the more little speed of the piece of cutting is just slow more, and accuracy rate is just high more, and recall rate is just low more.

In order to guarantee both balances, cutting should be separation with the nature node.Generally with tr, td, labels such as div carry out the nature cutting, purpose is to guarantee the consistance of cutting, guarantees as much as possible that promptly identical content no matter in any position, all is syncopated as identical result as much as possible, this requirement requires important especially to the ending of webpage, because if there is the inconsistency of cutting, its error can accumulate gradually so, has arrived more obvious that the back inconsistency of webpage can show.Length generally is controlled at and is no less than 20 bytes, can increase the probability increase that different content generates identical fingerprints because length is too short, also can cause the repetition of a webpage self, and excessive weak point also can increase operand simultaneously, does not also have practical meaning.The cut-point of suitable selection cutting and to cut apart length be for the non-possibility that guarantees to collide and the consistance of cutting, guarantee that from the angle of the probability statistics of mathematics different contents does not have identical fingerprint, the piece of cutting is big more, just might destroy the consistance of cutting more.But the size of the piece of cutting is also closely-related with the structure of webpage, can handle with bulk in the structure of web page simple parts, and this moment, the basic premise of piecemeal was: guaranteeing under the uncomplicated prerequisite of structure of web page that can try one's best is cut into bulk.Because big piece simple in structure can play the correcting of cutting.The byte that the rower of the webpage in the literal that the structure of web page simple parts is meant this part and this part is signed label literal such as tr, column label td, layer label div is than more than 10: 1.

Integrated some: with tr, td, label nodes such as div carry out the nature cutting; Length is controlled at and is no less than 20 bytes; The simple part of structure of web page is wanted cutting as much as possible: length is not limit.

In the time of concrete cutting, can be from first character of webpage, the node that scanning is set is td such as the node of setting, tr, div etc., if run into these nodes, just the position is set to the starting position of piece herein.Use the same method then and go for next position, if position adjacent apart from length greater than the minimum length of setting, it is 20 bytes that minimum length is set here, just the part in the middle of two positions is used as one, and the content to this piece adopts hash method to generate corresponding fingerprint then.The end position of setting this piece simultaneously is exactly the starting position of next piece, if the distance of position adjacent is less than minimum length, just continue to seek next node, it is invalid that middle node just is made as, up to the distance of node that finds the beginning of a node and this piece greater than 20 bytes, or find the ending of webpage, it is generated fingerprint.

The generation of concrete fingerprint value has different fingerprint values in order to guarantee different pieces, i.e. the non-collision of fingerprint value, select encryption method reliably for use, what use in the embodiment of the invention is the Hash encryption method, experimental results show that this method is reliably, can guarantee the non-collision of fingerprint value.

Among the step C, at first count the number of the webpage of the same type that gets access to, again the fingerprint of all pieces under this catalogue is put in the fingerprint Hash table, and the occurrence number of adding up every kind of piecemeal, the number of piece fingerprint repetition just.Described fingerprint Hash table adopts the mode of Hash (hash) table to store data, and the size of fingerprint Hash table is relevant with webpage number of the same type, generally is 20 times of the webpage number of the same type that gets access to, the rarest 10,000 node capacity.

Among the step D, the fingerprint Hash table is traveled through, will think the template fingerprint of this catalogue, and this template fingerprint is kept in the fingerprint Hash table of setting up in the web page template storehouse more than the piece fingerprint of pre-set threshold according to certain threshold value.Described fingerprint Hash table can be set up a table at each template, also can set up a big table at all templates.

Draw the template of this catalogue in this step according to certain rule, comprise a plurality of subtemplates automatically under possible this template.Pre-set threshold is an integer, and minimum is 3, is 10 to the maximum, preferably is made as 5, generally when n greater than 30 the time, get n ^0.3The value that rounds up, wherein n is the number that generates being used to of getting access to all webpages of the same type that comprise webpage to be resolved of template.The selection of threshold value also is according to collision probability on the mathematics and practical application and the numerical value that balance is come out.Because it is extremely low having guaranteed the identical collision probability of the fingerprint of different content in the time of piecemeal, the rule that generates according to webpage has guaranteed that the non-template part also has different fingerprints again, thereby qualitative, quantitative assurance the accuracy of template identification.

Step e is meant the ident value dirID into the catalogue of webpage to be resolved, and is related with each rubbish piecemeal characteristic of correspondence value foundation in the rubbish block collection, the fingerprint Hash table of setting up among the corresponding step C of the ident value dirID of a catalogue.In actual applications, this step e also can be set up the ident value dirID of catalogue and the eigenwert of rubbish piecemeal related, and directly the eigenwert with the rubbish piecemeal of all catalogues all is kept in the total fingerprint Hash table.

Step 6, utilization and the corresponding template of webpage to be resolved are resolved the content of this webpage.

Webpage http://news.sina.com.cn/h/2010-07-15/075320682851.shtml for example, the same catalogue http://news.sina.com.cn/s/2010-07-15 that obtains this url, obtain the ident value dirID:0xc4197e9b76b31efd of this catalogue with the hash method analysis, in the template Hash table in Web page module storehouse, seek whether there is this ident value dirID.Because this template generates, so can find this template corresponding identification value dirID in the template Hash table in the web page template storehouse.

Content that at first will this webpage to be resolved is carried out piecemeal, and generate a cryptographic hash of correspondence for according to the content of this piecemeal each piecemeal that splits, each cryptographic hash that generates is searched under the fingerprint Hash table of template correspondence, perhaps in total fingerprint Hash table, search, if there is this cryptographic hash, just illustrate that this piece is the template part that machine generates; Just illustrate that this piece is the personal characteristics part of webpage if can not find this cryptographic hash.All personal characteristics parts of extracting this webpage have just constituted the main contents of this webpage.Described content with this webpage is carried out piecemeal, and concrete block division method is identical with steps A in the step 5.

Claims

1. a method for analyzing internet web page contents is characterized in that, this method specifically may further comprise the steps:

Webpage to be resolved is read in step 1, initialization web page template storehouse; Establish a template Hash table in the described Web page module storehouse, all corresponding template of each the ident value dirID that writes down in this template Hash table;

Concrete judge that whether webpage to be resolved is generated by template is whether basis checks among the URL(uniform resource locator) url of webpage to be resolved except " // " after " http ", the sign "/" that also has catalogue, this webpage to be resolved is generated by template if exist then, and this webpage to be resolved is not generated by template if do not exist then;

Step 3, resolve this webpage, obtain analysis result according to common mode;

Step 4, at the catalogue of webpage to be resolved, generate an ident value dirID by hash method, and in the template Hash table in web page template storehouse, search whether there is this ident value dirID, if having execution in step six, otherwise execution in step five;

Step 5, obtain the webpage of the same type of webpage to be resolved, generate the template corresponding with webpage to be resolved, if adopt a fingerprint Hash table to write down the eigenwert of the rubbish piecemeal of all templates in the web page template storehouse, then upgrade the fingerprint Hash table, if set up a fingerprint Hash table at the template of each generation, then preserve the fingerprint Hash table of the template correspondence of the generation of setting up, join in the template Hash table in web page template storehouse by catalogue corresponding identification value dirID, the template that generates is joined in the web page template storehouse webpage to be resolved; The needed webpage number of described generation template otherwise less than the threshold value of minimum webpage;

Described webpage of the same type is meant the webpage that has same directory in the static Web page, or in the dynamic web page, with the webpage under the same basic class under the inlet;

Step 6, the content of webpage to be resolved is carried out piecemeal, and generate a cryptographic hash for each piecemeal according to the content of each piecemeal, at each cryptographic hash, search in the fingerprint Hash table of the fingerprint Hash table in the web page template storehouse or the template of webpage correspondence to be resolved and whether have this cryptographic hash, if exist, then the piecemeal to this cryptographic hash correspondence does not deal with, if there is no, then extract the content of the piecemeal of this cryptographic hash correspondence, all piecemeal contents of extracting have constituted the content of analysis result.

2. a kind of method for analyzing internet web page contents according to claim 1 is characterized in that, generating template described in the step 5 needs the threshold value of minimum webpage more than or equal to 3.

3. a kind of method for analyzing internet web page contents according to claim 1 and 2 is characterized in that, it is 10 that the template of generation described in the step 5 needs the threshold value of minimum webpage.

4. a kind of method for analyzing internet web page contents according to claim 1 is characterized in that, the generation of the template described in the step 5 specifically may further comprise the steps:

Steps A, the content of all webpages that get access to is all carried out piecemeal;

Step B, all generate an eigenwert for each piecemeal, this eigenwert adopts hash method to generate;

Step D, the frequency of occurrences is labeled as the rubbish piecemeal greater than the piecemeal of pre-set threshold, each rubbish piecemeal characteristic of correspondence value is saved in the fingerprint Hash table.

Step e, if set up a fingerprint Hash table at the template of each generation, then be that the ident value dirID of webpage catalogue to be resolved is related with corresponding template fingerprint table foundation.

5. according to claim 1 or 4 described a kind of method for analyzing internet web page contents, it is characterized in that, the content to webpage described in step 6 or the steps A is carried out block division method and is: carry out the nature cutting with label tr, td and div, the length setting is no less than 20 bytes, cutting length to the simple part of structure of web page is not limit, and wherein tr, td and div represent rower label, column label, the layer label of webpage respectively.

6. a kind of method for analyzing internet web page contents according to claim 4 is characterized in that, the described pre-set threshold of step D, and minimum is set at 3, greater than 30 o'clock, gets n at n ^0.3The value that rounds up, but maximum occurrences is 10, wherein n represents to get access to is used to generate the number of the webpage of template.

7. a kind of method for analyzing internet web page contents according to claim 6 is characterized in that described pre-set threshold is set at 5.

8. a kind of method for analyzing internet web page contents according to claim 1 is characterized in that, described template Hash table and fingerprint Hash table all adopt the mode of Hash table to store data.