Triple webpage text content identifications and filter method based on Chinese punctuation mark
Technical field
The invention belongs to filed of network information security, relate to the identification and the filtration of the bad text message of Chinese web page.
Background technology
In existing several web page contents safety products, as " network nurse " and " network father " etc., their mostly adopt based on the method for URL address and keyword and forbid visit to illegal web page and website, for the diversity and dynamic of online illegal contents, this method that adopts static address base or manually upgrade network address and keyword far can not satisfy people's filtration requirement, and the heads of a family expect to have the more effectively and comprehensively appearance of information filtering product.
Existing filter method for webpage text content mainly carries out round vector space model.
Liu Peide etc. utilize vector space model, TC3 sorting algorithm, Rocchio feedback model etc. to construct a network information filtration system (NIFS) with feedback mechanism, and this system can realize the text filtering based on the user interest file.
The information safety filtrating system based on vector space model that Cao Yi, He Weihong set up then is divided into filtration the masterplate training and two stages of adaptive filtering carry out.In the training stage, set up initial filtering template by theme processing and feature extraction, initial threshold is set; At filtration stage, then adjust masterplate and threshold value adaptively according to user's feedback information, the characteristics of this method are mainly reflected in the design of filtering template training algorithm.
Shian-Hua Lin and Jan-Ming Ho be in proposing a method of removing noise content in the webpage in 2002, this method according in the webpage<table the tag tree of label configurations webpage, throwing the net one, page or leaf is regular to be mutually nested content piece; Then, for the webpage collection that the same masterplate of use generates, finding out at this webpage and concentrate the content piece that repeatedly occurs, as the noise content, is exactly the effective information piece and concentrate the less content piece of appearance at this webpage.
Fudan University has proposed the Internet filtration system and the filter method of a kind of content-based filtering proxy (CFA), and system framework comprises: information filtering agency (CFA), querying server (QS), content analysis and management server (CAMS) three parts.The filtering process of Web content filtration system is: when the user sent the request that certain URL is conducted interviews, CFA was according to the black and white lists that the user is provided with, and allowed or forbade this access request.If this URL is not in the black and white lists of CFA, CFA then sends query requests to querying server QS.QS will inquire about the rating information of this URL and the result is returned to CFA in the URL storehouse of oneself.CFA makes a response in view of the above.QS meeting simultaneously is the URL rating information of down loading updating from CAMS regularly.
And " the information filtering technology that is used for network browsing " of Microsoft provides a kind of user of control could visit the system and method for some internet site when using a computer.When the computer user attempts to visit one during by the internet site of specifying uniform resource locator (URL) to point to, filter is tabulated by permission-prevention and is provided reference to URL, and by reference---the cross reference age group checks that age group allows the categorised content mapping table of watching, and correspondingly determines the visit to the website of URL sensing.
Sum up previous finding, can see that the internet information filter method still has the following disadvantages up till now:
1. adopt the filter method of URL and keyword, filtration accuracy rate and the full rate of filter are lower, and filter is easy to be bypassed;
2. employing is slow based on the content filtering method rate of filtration in text vector space separately, can't satisfy the requirement of broadband network transfer of data real time filtering;
3. less for the preprocessing process research of webpage, especially do not see bibliographical information as yet, and the research of this respect problem can improve the speed that web data is handled effectively about the research of generic web pages body matter extracting method;
4. content recognition and the filter method at the Chinese web page characteristics also has not seen reported.
Summary of the invention
Filter the limitation that accurate rate, the full rate of filter and the rate of filtration can't satisfy network traffics in order to overcome existing info web filter method, the invention provides a kind of with existing based on URL, based on keyword and the triple filter method that organically merges based on the text filtering method of vector space; In url filtering, be provided with legal URL and illegal URL table, promptly black and white lists improves the speed of filtering; Adopt Winsock 2 SPI directly to intercept and capture the HTTP packet, saved the trouble that when bottom intercepted data bag, will recombinate with protocol analysis in application layer; Text recognition of Chinese web page text and denoising method based on Chinese punctuation mark statistical value have been proposed.
For reaching above-mentioned target, the present invention adopts following technical scheme:
System adopts the three-stage filtration pattern, is respectively url filtering, keyword filtration, text content filtering.
System configuration as shown in Figure 1, wherein:
The url filtering module
By illegal url list (blacklist) and the legal url list (white list) that sets in advance, judge whether user's request is legal.
Content is intercepted and captured and extraction module
Intercept and capture the suspicious request responding of returning from server end (HTTP packet) earlier, extract html document then, the ultimate analysis html document extracts link information and body matter.
The keyword filtration module
At link information, judge whether contain illegal link in the webpage with keyword, as long as contain illegal link, this webpage also can obtain shielding.
The information filtering module
The suspicious Web page text that contains legal link is carried out participle, removes stop words, calculates weight and feature extraction, be expressed as vector space model afterwards, and be complementary, judge whether its content is legal with the characteristic vector that trains.
The operating procedure of system of the present invention is summarized as follows:
1. when the user sends linking request, compared with the address list in the black and white lists in the request URL address, and handle accordingly.For neither belonging to the request address that blacklist does not belong to white list yet, be labeled as suspicious request.
2. intercept and capture suspicious request responding, i.e. the HTTP packet that returns of server end.Because Winsock 2 SPI intercept and capture in application layer, thus the trouble that when bottom intercepted data bag, will carry out packet reorganization and protocol analysis saved, the efficient height, CPU usage is low.
3. from the HTTP packet of intercepting and capturing, extract html file, therefrom extract link information, and adopt Web page text content identification method to obtain the Web page text content of text based on Chinese punctuation mark statistical value.
4. adopt filter method, check link information,, return warning message, otherwise change the information filtering module if be non-legal link based on keyword.
5. set up Chinese web page flame text classification corpus, as the sample training masterplate of webpage text content.The Web page text implementation content is filtered, check its legitimacy, return to the user for legal content of text, illegal content of text directly shields, and upgrades url list.
Effect of the present invention and benefit are to adopt Winsock 2SPI function directly to intercept and capture the HTTP packet in application layer, have saved the trouble that will recombinate when bottom intercepted data bag with protocol analysis.Employing can effectively be removed noise informations such as navigation information, peer link information, advertisement link information, copyright information based on the webpage text content identification and the acquisition methods of Chinese punctuation mark statistical value.The present invention can improve speed, accuracy rate and the filtering accuracy that info web filters effectively.The filtration of Chinese web page flame can be used for, and user individual text classification information service field can be widely used in.
Description of drawings
Fig. 1 is based on the webpage text content filtration system overall construction drawing of Chinese punctuation mark.
Fig. 2 is the url filtering flow chart.
Fig. 3 is the info web HTML nested structure and the representation of knowledge of HTML tree.
Fig. 4 is the information filtering process chart.
Embodiment
Below in conjunction with technical scheme and accompanying drawing, be described in detail the specific embodiment of the present invention.
Step 1
When the user imports a certain network address in browser's address bar, or in the webpage clicking during a certain link information, compare (as shown in Figure 2) with the address list in the black and white lists in the URL address that filter will be asked, for the URL request that belongs in the white list, system lets pass; For the URL request that belongs in the blacklist, system mask is also returned warning message; For neither belonging to the URL that blacklist does not belong to white list yet, be labeled as suspicious request, execution in step 2.
Step 2
Adopt Winsock 2SPI technology to intercept and capture the HTTP packet that suspicious requested service device end returns.
Step 3
From the HTTP packet that the 2nd step is intercepted and captured, extract html file, analyze html file and extract link information; And analyze HTML tree (as shown in Figure 3), and adopt webpage context extraction method based on Chinese punctuation mark, remove noise informations such as navigation information, peer link information, advertisement link information, copyright information effectively, obtain the Web page text content of text.
Step 4
The hyperlinked information that extracts for step 3, check whether contain illegal keyword in the link with the method for pattern matching, if have, then this link is defined as illegal link, this link of system mask is also returned warning message, otherwise execution in step 5 is carried out information filtering, judges the legitimacy of web page contents.
Information filtering is the core of native system, its basic filtering flow process as shown in Figure 4, filtration step is as follows:
Step 5
For the suspicious Web page text content that extracts by step 3 and step 4, adopt and carry out word segmentation processing based on dictionary and forward maximum matching algorithm.
Step 6
According to the stop words in the vocabulary removal word segmentation result of stopping using, promptly remove some insignificant speech, eliminate of the influence of these speech to judged result.
Step 7
Use the method for word frequency statistics, carry out the feature speech and extract, promptly extract the speech that more can show file characteristics, to improve program efficiency, the speed of service and nicety of grading.
Step 8
Adopt TF-IDF formula calculated characteristics speech weight.
Step 9
Generate the characteristic vector of the text, calculate in this vector and the characteristic vector storehouse included angle cosine between sample vector, obtain the similarity value.
Step 10
This similarity value and the threshold value that sets are compared, and it is 0.6-08 that the present invention is provided with threshold value, determines web page contents character.When the similarity value is higher than the threshold value of regulation, then this webpage is defined as illegally, system's denied access; Be lower than the threshold value of regulation as similarity, then the text is defined as legally, and system accepts the interview.
Step 11
Upgrade legal URL and illegal url list, add in the blacklist URL address that is about to be defined as illegal text, and add in the white list URL address of legal text, to avoid that same web page contents is repeated information filtering, improves filter efficiency.
The execution of foregoing filter method needs the sample vector masterplate in the characteristic vector storehouse, and the sample vector masterplate obtains by illegal corpus Chinese version training, training process as shown in Figure 4, step is as follows:
1) sets up network flame corpus.
2), adopt based on the method for dictionary and the maximum coupling of forward the training document is carried out the Chinese word segmentation processing for the samples of text in the illegal corpus.
3) according to the stop words in the vocabulary removal word segmentation result of stopping using, obtain the higher-dimension word set.
4) above-mentioned higher-dimension word set is carried out feature extraction with the method for word frequency statistics.
5) weight of employing TF-IDF formula calculated characteristics speech.
6) vector space model of generation document deposits the characteristic vector storehouse in, generates the sample vector masterplate.