[go: up one dir, main page]

CN101035128A - Three-folded webpage text content recognition and filtering method based on the Chinese punctuation - Google Patents

Three-folded webpage text content recognition and filtering method based on the Chinese punctuation Download PDF

Info

Publication number
CN101035128A
CN101035128A CNA2007100110571A CN200710011057A CN101035128A CN 101035128 A CN101035128 A CN 101035128A CN A2007100110571 A CNA2007100110571 A CN A2007100110571A CN 200710011057 A CN200710011057 A CN 200710011057A CN 101035128 A CN101035128 A CN 101035128A
Authority
CN
China
Prior art keywords
text
information
filtering
web page
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007100110571A
Other languages
Chinese (zh)
Other versions
CN101035128B (en
Inventor
宋明秋
吴新涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN2007100110571A priority Critical patent/CN101035128B/en
Publication of CN101035128A publication Critical patent/CN101035128A/en
Application granted granted Critical
Publication of CN101035128B publication Critical patent/CN101035128B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

一种基于中文标点符号的三重网页文本内容识别及过滤方法。该方法针对现有的基于URL、基于关键字的网页信息过滤方法中存在的滤准率和滤全率低的问题,提出了一种复合型的基于URL、基于关键字、以及基于文本向量空间知识表示方法的网页文本内容过滤方法。采用基于黑白名单的URL地址过滤方法;采用中文标点符号的统计特征来有效地去除导航信息、相关链接信息、广告链接信息、版权信息等网页内容噪声信息,提取文本内容;采用向量空间模型进行文本知识表示,通过计算文本向量与不良信息模版中特征向量间的夹角余弦,与设定的阈值相比较,确定文本所属类别。该发明可广泛地应用于网络不良信息的过滤及网页个性化信息服务领域。

Figure 200710011057

A three-fold web page text content recognition and filtering method based on Chinese punctuation marks. This method aims at the problems of low filtering accuracy and filtering rate existing in the existing URL-based and keyword-based web page information filtering methods, and proposes a compound URL-based, keyword-based, and text-based vector space Webpage text content filtering method of knowledge representation method. Use the URL address filtering method based on black and white lists; use the statistical characteristics of Chinese punctuation marks to effectively remove the noise information of web page content such as navigation information, related link information, advertising link information, copyright information, etc., and extract text content; use vector space model for text Knowledge representation, by calculating the cosine of the angle between the text vector and the feature vector in the bad information template, and comparing it with the set threshold to determine the category of the text. The invention can be widely used in the field of filtering bad information on the network and personalized information service of the webpage.

Figure 200710011057

Description

Triple webpage text content identifications and filter method based on Chinese punctuation mark
Technical field
The invention belongs to filed of network information security, relate to the identification and the filtration of the bad text message of Chinese web page.
Background technology
In existing several web page contents safety products, as " network nurse " and " network father " etc., their mostly adopt based on the method for URL address and keyword and forbid visit to illegal web page and website, for the diversity and dynamic of online illegal contents, this method that adopts static address base or manually upgrade network address and keyword far can not satisfy people's filtration requirement, and the heads of a family expect to have the more effectively and comprehensively appearance of information filtering product.
Existing filter method for webpage text content mainly carries out round vector space model.
Liu Peide etc. utilize vector space model, TC3 sorting algorithm, Rocchio feedback model etc. to construct a network information filtration system (NIFS) with feedback mechanism, and this system can realize the text filtering based on the user interest file.
The information safety filtrating system based on vector space model that Cao Yi, He Weihong set up then is divided into filtration the masterplate training and two stages of adaptive filtering carry out.In the training stage, set up initial filtering template by theme processing and feature extraction, initial threshold is set; At filtration stage, then adjust masterplate and threshold value adaptively according to user's feedback information, the characteristics of this method are mainly reflected in the design of filtering template training algorithm.
Shian-Hua Lin and Jan-Ming Ho be in proposing a method of removing noise content in the webpage in 2002, this method according in the webpage<table the tag tree of label configurations webpage, throwing the net one, page or leaf is regular to be mutually nested content piece; Then, for the webpage collection that the same masterplate of use generates, finding out at this webpage and concentrate the content piece that repeatedly occurs, as the noise content, is exactly the effective information piece and concentrate the less content piece of appearance at this webpage.
Fudan University has proposed the Internet filtration system and the filter method of a kind of content-based filtering proxy (CFA), and system framework comprises: information filtering agency (CFA), querying server (QS), content analysis and management server (CAMS) three parts.The filtering process of Web content filtration system is: when the user sent the request that certain URL is conducted interviews, CFA was according to the black and white lists that the user is provided with, and allowed or forbade this access request.If this URL is not in the black and white lists of CFA, CFA then sends query requests to querying server QS.QS will inquire about the rating information of this URL and the result is returned to CFA in the URL storehouse of oneself.CFA makes a response in view of the above.QS meeting simultaneously is the URL rating information of down loading updating from CAMS regularly.
And " the information filtering technology that is used for network browsing " of Microsoft provides a kind of user of control could visit the system and method for some internet site when using a computer.When the computer user attempts to visit one during by the internet site of specifying uniform resource locator (URL) to point to, filter is tabulated by permission-prevention and is provided reference to URL, and by reference---the cross reference age group checks that age group allows the categorised content mapping table of watching, and correspondingly determines the visit to the website of URL sensing.
Sum up previous finding, can see that the internet information filter method still has the following disadvantages up till now:
1. adopt the filter method of URL and keyword, filtration accuracy rate and the full rate of filter are lower, and filter is easy to be bypassed;
2. employing is slow based on the content filtering method rate of filtration in text vector space separately, can't satisfy the requirement of broadband network transfer of data real time filtering;
3. less for the preprocessing process research of webpage, especially do not see bibliographical information as yet, and the research of this respect problem can improve the speed that web data is handled effectively about the research of generic web pages body matter extracting method;
4. content recognition and the filter method at the Chinese web page characteristics also has not seen reported.
Summary of the invention
Filter the limitation that accurate rate, the full rate of filter and the rate of filtration can't satisfy network traffics in order to overcome existing info web filter method, the invention provides a kind of with existing based on URL, based on keyword and the triple filter method that organically merges based on the text filtering method of vector space; In url filtering, be provided with legal URL and illegal URL table, promptly black and white lists improves the speed of filtering; Adopt Winsock 2 SPI directly to intercept and capture the HTTP packet, saved the trouble that when bottom intercepted data bag, will recombinate with protocol analysis in application layer; Text recognition of Chinese web page text and denoising method based on Chinese punctuation mark statistical value have been proposed.
For reaching above-mentioned target, the present invention adopts following technical scheme:
System adopts the three-stage filtration pattern, is respectively url filtering, keyword filtration, text content filtering.
System configuration as shown in Figure 1, wherein:
The url filtering module
By illegal url list (blacklist) and the legal url list (white list) that sets in advance, judge whether user's request is legal.
Content is intercepted and captured and extraction module
Intercept and capture the suspicious request responding of returning from server end (HTTP packet) earlier, extract html document then, the ultimate analysis html document extracts link information and body matter.
The keyword filtration module
At link information, judge whether contain illegal link in the webpage with keyword, as long as contain illegal link, this webpage also can obtain shielding.
The information filtering module
The suspicious Web page text that contains legal link is carried out participle, removes stop words, calculates weight and feature extraction, be expressed as vector space model afterwards, and be complementary, judge whether its content is legal with the characteristic vector that trains.
The operating procedure of system of the present invention is summarized as follows:
1. when the user sends linking request, compared with the address list in the black and white lists in the request URL address, and handle accordingly.For neither belonging to the request address that blacklist does not belong to white list yet, be labeled as suspicious request.
2. intercept and capture suspicious request responding, i.e. the HTTP packet that returns of server end.Because Winsock 2 SPI intercept and capture in application layer, thus the trouble that when bottom intercepted data bag, will carry out packet reorganization and protocol analysis saved, the efficient height, CPU usage is low.
3. from the HTTP packet of intercepting and capturing, extract html file, therefrom extract link information, and adopt Web page text content identification method to obtain the Web page text content of text based on Chinese punctuation mark statistical value.
4. adopt filter method, check link information,, return warning message, otherwise change the information filtering module if be non-legal link based on keyword.
5. set up Chinese web page flame text classification corpus, as the sample training masterplate of webpage text content.The Web page text implementation content is filtered, check its legitimacy, return to the user for legal content of text, illegal content of text directly shields, and upgrades url list.
Effect of the present invention and benefit are to adopt Winsock 2SPI function directly to intercept and capture the HTTP packet in application layer, have saved the trouble that will recombinate when bottom intercepted data bag with protocol analysis.Employing can effectively be removed noise informations such as navigation information, peer link information, advertisement link information, copyright information based on the webpage text content identification and the acquisition methods of Chinese punctuation mark statistical value.The present invention can improve speed, accuracy rate and the filtering accuracy that info web filters effectively.The filtration of Chinese web page flame can be used for, and user individual text classification information service field can be widely used in.
Description of drawings
Fig. 1 is based on the webpage text content filtration system overall construction drawing of Chinese punctuation mark.
Fig. 2 is the url filtering flow chart.
Fig. 3 is the info web HTML nested structure and the representation of knowledge of HTML tree.
Fig. 4 is the information filtering process chart.
Embodiment
Below in conjunction with technical scheme and accompanying drawing, be described in detail the specific embodiment of the present invention.
Step 1
When the user imports a certain network address in browser's address bar, or in the webpage clicking during a certain link information, compare (as shown in Figure 2) with the address list in the black and white lists in the URL address that filter will be asked, for the URL request that belongs in the white list, system lets pass; For the URL request that belongs in the blacklist, system mask is also returned warning message; For neither belonging to the URL that blacklist does not belong to white list yet, be labeled as suspicious request, execution in step 2.
Step 2
Adopt Winsock 2SPI technology to intercept and capture the HTTP packet that suspicious requested service device end returns.
Step 3
From the HTTP packet that the 2nd step is intercepted and captured, extract html file, analyze html file and extract link information; And analyze HTML tree (as shown in Figure 3), and adopt webpage context extraction method based on Chinese punctuation mark, remove noise informations such as navigation information, peer link information, advertisement link information, copyright information effectively, obtain the Web page text content of text.
Step 4
The hyperlinked information that extracts for step 3, check whether contain illegal keyword in the link with the method for pattern matching, if have, then this link is defined as illegal link, this link of system mask is also returned warning message, otherwise execution in step 5 is carried out information filtering, judges the legitimacy of web page contents.
Information filtering is the core of native system, its basic filtering flow process as shown in Figure 4, filtration step is as follows:
Step 5
For the suspicious Web page text content that extracts by step 3 and step 4, adopt and carry out word segmentation processing based on dictionary and forward maximum matching algorithm.
Step 6
According to the stop words in the vocabulary removal word segmentation result of stopping using, promptly remove some insignificant speech, eliminate of the influence of these speech to judged result.
Step 7
Use the method for word frequency statistics, carry out the feature speech and extract, promptly extract the speech that more can show file characteristics, to improve program efficiency, the speed of service and nicety of grading.
Step 8
Adopt TF-IDF formula calculated characteristics speech weight.
Step 9
Generate the characteristic vector of the text, calculate in this vector and the characteristic vector storehouse included angle cosine between sample vector, obtain the similarity value.
Step 10
This similarity value and the threshold value that sets are compared, and it is 0.6-08 that the present invention is provided with threshold value, determines web page contents character.When the similarity value is higher than the threshold value of regulation, then this webpage is defined as illegally, system's denied access; Be lower than the threshold value of regulation as similarity, then the text is defined as legally, and system accepts the interview.
Step 11
Upgrade legal URL and illegal url list, add in the blacklist URL address that is about to be defined as illegal text, and add in the white list URL address of legal text, to avoid that same web page contents is repeated information filtering, improves filter efficiency.
The execution of foregoing filter method needs the sample vector masterplate in the characteristic vector storehouse, and the sample vector masterplate obtains by illegal corpus Chinese version training, training process as shown in Figure 4, step is as follows:
1) sets up network flame corpus.
2), adopt based on the method for dictionary and the maximum coupling of forward the training document is carried out the Chinese word segmentation processing for the samples of text in the illegal corpus.
3) according to the stop words in the vocabulary removal word segmentation result of stopping using, obtain the higher-dimension word set.
4) above-mentioned higher-dimension word set is carried out feature extraction with the method for word frequency statistics.
5) weight of employing TF-IDF formula calculated characteristics speech.
6) vector space model of generation document deposits the characteristic vector storehouse in, generates the sample vector masterplate.

Claims (1)

1. the triple webpage text contents based on Chinese punctuation mark are discerned and filter method, a kind of triple info web filtration system architectures that combine based on URL address, keyword and content are provided, it is characterized in that, adopt Winsock 2 SPI functions directly to intercept and capture the HTTP packet in application layer; Employing is based on the general Chinese web page noise remove and the text acquisition methods of Chinese punctuation mark statistical value; Set up Chinese web page flame text classification corpus, as the sample training masterplate of webpage text content.
CN2007100110571A 2007-04-18 2007-04-18 Recognition and filtering method of triple webpage text content based on Chinese punctuation marks Expired - Fee Related CN101035128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007100110571A CN101035128B (en) 2007-04-18 2007-04-18 Recognition and filtering method of triple webpage text content based on Chinese punctuation marks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007100110571A CN101035128B (en) 2007-04-18 2007-04-18 Recognition and filtering method of triple webpage text content based on Chinese punctuation marks

Publications (2)

Publication Number Publication Date
CN101035128A true CN101035128A (en) 2007-09-12
CN101035128B CN101035128B (en) 2010-04-21

Family

ID=38731427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007100110571A Expired - Fee Related CN101035128B (en) 2007-04-18 2007-04-18 Recognition and filtering method of triple webpage text content based on Chinese punctuation marks

Country Status (1)

Country Link
CN (1) CN101035128B (en)

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101901314A (en) * 2009-06-19 2010-12-01 卡巴斯基实验室封闭式股份公司 The detection of wrong report and minimizing during anti-malware is handled
CN102054030A (en) * 2010-12-17 2011-05-11 惠州Tcl移动通信有限公司 Mobile terminal webpage display control method and device
CN102106114A (en) * 2008-05-28 2011-06-22 兹斯卡勒公司 Distributed security provisioning
CN102136973A (en) * 2010-09-08 2011-07-27 乔永清 System and method for monitoring real data of website
CN102170640A (en) * 2011-06-01 2011-08-31 南通海韵信息技术服务有限公司 Mode library-based smart mobile phone terminal adverse content website identifying method
CN102236654A (en) * 2010-04-26 2011-11-09 广东开普互联信息科技有限公司 Web Invalid Link Filtering Method Based on Content Correlation
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102469117A (en) * 2010-11-08 2012-05-23 中国移动通信集团广东有限公司 Method and device for identifying abnormal access behaviors
CN102546576A (en) * 2010-12-31 2012-07-04 北京启明星辰信息技术股份有限公司 Webpagehanging trojan detecting and protecting method and system as well as method for extracting corresponding code
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102622435A (en) * 2012-02-29 2012-08-01 百度在线网络技术(北京)有限公司 Method and device for detecting black chain
CN102624703A (en) * 2011-12-31 2012-08-01 成都市华为赛门铁克科技有限公司 Uniform resource locator URL filtering method and device
CN102647408A (en) * 2012-02-27 2012-08-22 珠海市君天电子科技有限公司 Method for judging phishing website based on content analysis
CN102754488A (en) * 2011-04-18 2012-10-24 华为技术有限公司 User access control method, device and system
CN102855320A (en) * 2012-09-04 2013-01-02 珠海市君天电子科技有限公司 Method and device for collecting keyword related URL (uniform resource locator) by search engine
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base
CN102929872A (en) * 2011-08-08 2013-02-13 阿里巴巴集团控股有限公司 Computer-implemented message filtering method, message filtering device and system
CN103092994A (en) * 2013-02-20 2013-05-08 苏州思方信息科技有限公司 Support vector machine (SVM) text automatic sorting method and system based on information concept lattice correction
CN101855632B (en) * 2007-11-08 2013-10-30 上海惠普有限公司 URL and anchor text analysis for focused crawling
CN103581144A (en) * 2012-08-06 2014-02-12 无锡稳捷网络技术有限公司 Network safety access control method based on ICAP
CN101739439B (en) * 2009-11-30 2014-03-12 中兴通讯股份有限公司 Method and system for dynamically customizing statistical object based on template
CN103778226A (en) * 2014-01-23 2014-05-07 北京奇虎科技有限公司 Method for establishing language information recognition model and language information recognition device
CN103853747A (en) * 2012-11-30 2014-06-11 腾讯科技(深圳)有限公司 Method and device for controlling sound source webpage
CN104052722A (en) * 2013-03-15 2014-09-17 腾讯科技(深圳)有限公司 Web address security detection method, apparatus and system
CN104079528A (en) * 2013-03-26 2014-10-01 北大方正集团有限公司 Method and system of safety protection of Web application
CN104462613A (en) * 2012-06-20 2015-03-25 北京奇虎科技有限公司 Hot spot aggregating method and device
CN104951553A (en) * 2015-06-30 2015-09-30 成都蓝码科技发展有限公司 Content collecting and data mining platform accurate in data processing and implementation method thereof
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105491023A (en) * 2015-11-24 2016-04-13 国网智能电网研究院 Data isolation exchange and security filtering method orienting electric power internet of things
CN105608083A (en) * 2014-11-13 2016-05-25 北京搜狗科技发展有限公司 Method and device for obtaining input library, and electronic equipment
CN105812417A (en) * 2014-12-29 2016-07-27 国基电子(上海)有限公司 Remote server, router and bad webpage information filtering method
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device
CN103605691B (en) * 2013-11-04 2017-04-26 北京奇虎科技有限公司 Device and method used for processing issued contents in social network
CN106649338A (en) * 2015-10-30 2017-05-10 中国移动通信集团公司 Information filtering policy generation method and apparatus
CN106789980A (en) * 2016-12-07 2017-05-31 北京亚鸿世纪科技发展有限公司 A kind of monitoring administration method and device of website legitimacy
CN107122350A (en) * 2017-04-27 2017-09-01 北京易麦克科技有限公司 A kind of feature extraction system and method for many paragraph texts
CN107766551A (en) * 2017-10-31 2018-03-06 广东小天才科技有限公司 Website auditing and controlling method based on big data analysis and terminal equipment
CN107835197A (en) * 2017-12-15 2018-03-23 江苏盖亚建筑工程有限公司 A kind of network transmission system
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 Method and device for identifying website
CN109688205A (en) * 2018-12-07 2019-04-26 麒麟合盛网络技术股份有限公司 The hold-up interception method and device of web page resources
CN109743309A (en) * 2018-12-28 2019-05-10 微梦创科网络科技(中国)有限公司 A kind of illegal request identification method, device and electronic equipment
CN110020075A (en) * 2017-10-20 2019-07-16 南京烽火软件科技有限公司 Device is excavated in illegal website automatically
CN110110195A (en) * 2019-05-07 2019-08-09 宜人恒业科技发展(北京)有限公司 A kind of impurity sweep-out method and device
CN110750639A (en) * 2019-07-02 2020-02-04 厦门美域中央信息科技有限公司 Text classification and R language realization based on vector space model
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN111382061A (en) * 2018-12-29 2020-07-07 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111741007A (en) * 2020-07-06 2020-10-02 桦蓥(上海)信息科技有限责任公司 Financial business real-time monitoring system and method based on network layer message analysis
CN114024947A (en) * 2022-01-05 2022-02-08 北京微步在线科技有限公司 Web access method and device based on browser
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Web page filtering method, device, equipment and storage medium
CN117176483A (en) * 2023-11-03 2023-12-05 北京艾瑞数智科技有限公司 Abnormal URL identification method and device and related products
EP4418141A1 (en) * 2023-02-17 2024-08-21 Kpmg Llp Document clustering using natural language processing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method and system for extracting and processing network information
US20070022202A1 (en) * 2005-07-22 2007-01-25 Finkle Karyn S System and method for deactivating web pages
CN100361450C (en) * 2005-11-18 2008-01-09 郑州金惠计算机系统工程有限公司 A system that blocks pornographic images and inappropriate information on the Internet

Cited By (76)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101855632B (en) * 2007-11-08 2013-10-30 上海惠普有限公司 URL and anchor text analysis for focused crawling
CN102106114A (en) * 2008-05-28 2011-06-22 兹斯卡勒公司 Distributed security provisioning
CN102106114B (en) * 2008-05-28 2014-10-22 兹斯卡勒公司 Distributed security provisioning method and its system
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101901314B (en) * 2009-06-19 2013-07-17 卡巴斯基实验室封闭式股份公司 Detection and minimization of false positives in anti-malware processing
CN101901314A (en) * 2009-06-19 2010-12-01 卡巴斯基实验室封闭式股份公司 The detection of wrong report and minimizing during anti-malware is handled
CN101739439B (en) * 2009-11-30 2014-03-12 中兴通讯股份有限公司 Method and system for dynamically customizing statistical object based on template
CN102236654A (en) * 2010-04-26 2011-11-09 广东开普互联信息科技有限公司 Web Invalid Link Filtering Method Based on Content Correlation
CN102136973A (en) * 2010-09-08 2011-07-27 乔永清 System and method for monitoring real data of website
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102411587B (en) * 2010-09-21 2013-08-21 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102469117A (en) * 2010-11-08 2012-05-23 中国移动通信集团广东有限公司 Method and device for identifying abnormal access behaviors
CN102469117B (en) * 2010-11-08 2014-11-05 中国移动通信集团广东有限公司 Method and device for identifying abnormal access action
CN102054030A (en) * 2010-12-17 2011-05-11 惠州Tcl移动通信有限公司 Mobile terminal webpage display control method and device
CN102546576B (en) * 2010-12-31 2015-11-18 北京启明星辰信息技术股份有限公司 A kind of web page horse hanging detects and means of defence, system and respective code extracting method
CN102546576A (en) * 2010-12-31 2012-07-04 北京启明星辰信息技术股份有限公司 Webpagehanging trojan detecting and protecting method and system as well as method for extracting corresponding code
CN102754488A (en) * 2011-04-18 2012-10-24 华为技术有限公司 User access control method, device and system
CN102170640A (en) * 2011-06-01 2011-08-31 南通海韵信息技术服务有限公司 Mode library-based smart mobile phone terminal adverse content website identifying method
CN102929872B (en) * 2011-08-08 2016-04-27 阿里巴巴集团控股有限公司 By computer-implemented information filtering method, message screening Apparatus and system
CN102929872A (en) * 2011-08-08 2013-02-13 阿里巴巴集团控股有限公司 Computer-implemented message filtering method, message filtering device and system
US9331981B2 (en) 2011-12-31 2016-05-03 Huawei Technologies Co., Ltd. Method and apparatus for filtering URL
CN102624703A (en) * 2011-12-31 2012-08-01 成都市华为赛门铁克科技有限公司 Uniform resource locator URL filtering method and device
CN102567534B (en) * 2011-12-31 2014-02-19 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102624703B (en) * 2011-12-31 2015-01-21 华为数字技术(成都)有限公司 Method and device for filtering uniform resource locators (URLs)
CN102647408A (en) * 2012-02-27 2012-08-22 珠海市君天电子科技有限公司 Method for judging phishing website based on content analysis
CN102622435A (en) * 2012-02-29 2012-08-01 百度在线网络技术(北京)有限公司 Method and device for detecting black chain
CN102622435B (en) * 2012-02-29 2017-12-12 百度在线网络技术(北京)有限公司 A kind of method and apparatus for detecting black chain
CN104462613A (en) * 2012-06-20 2015-03-25 北京奇虎科技有限公司 Hot spot aggregating method and device
CN103581144A (en) * 2012-08-06 2014-02-12 无锡稳捷网络技术有限公司 Network safety access control method based on ICAP
CN102855320A (en) * 2012-09-04 2013-01-02 珠海市君天电子科技有限公司 Method and device for collecting keyword related URL (uniform resource locator) by search engine
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base
CN102902793B (en) * 2012-09-29 2016-12-21 北京奇虎科技有限公司 Webpage category knowledge base set up system and method
CN103853747B (en) * 2012-11-30 2018-09-04 腾讯科技(深圳)有限公司 A kind of control method and device of sound source webpage
CN103853747A (en) * 2012-11-30 2014-06-11 腾讯科技(深圳)有限公司 Method and device for controlling sound source webpage
CN103092994A (en) * 2013-02-20 2013-05-08 苏州思方信息科技有限公司 Support vector machine (SVM) text automatic sorting method and system based on information concept lattice correction
CN104052722A (en) * 2013-03-15 2014-09-17 腾讯科技(深圳)有限公司 Web address security detection method, apparatus and system
CN104079528A (en) * 2013-03-26 2014-10-01 北大方正集团有限公司 Method and system of safety protection of Web application
CN103605691B (en) * 2013-11-04 2017-04-26 北京奇虎科技有限公司 Device and method used for processing issued contents in social network
CN103778226A (en) * 2014-01-23 2014-05-07 北京奇虎科技有限公司 Method for establishing language information recognition model and language information recognition device
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105608083A (en) * 2014-11-13 2016-05-25 北京搜狗科技发展有限公司 Method and device for obtaining input library, and electronic equipment
CN105608083B (en) * 2014-11-13 2019-09-03 北京搜狗科技发展有限公司 Obtain the method, apparatus and electronic equipment of input magazine
CN105812417A (en) * 2014-12-29 2016-07-27 国基电子(上海)有限公司 Remote server, router and bad webpage information filtering method
CN105812417B (en) * 2014-12-29 2019-05-03 国基电子(上海)有限公司 Remote server, router and bad webpage information filtering method
CN104951553A (en) * 2015-06-30 2015-09-30 成都蓝码科技发展有限公司 Content collecting and data mining platform accurate in data processing and implementation method thereof
CN106649338A (en) * 2015-10-30 2017-05-10 中国移动通信集团公司 Information filtering policy generation method and apparatus
CN106649338B (en) * 2015-10-30 2020-08-21 中国移动通信集团公司 Information filtering strategy generation method and device
CN105491023B (en) * 2015-11-24 2020-10-27 国网智能电网研究院 Data isolation exchange and safety filtering method for power Internet of things
CN105491023A (en) * 2015-11-24 2016-04-13 国网智能电网研究院 Data isolation exchange and security filtering method orienting electric power internet of things
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device
CN106789980A (en) * 2016-12-07 2017-05-31 北京亚鸿世纪科技发展有限公司 A kind of monitoring administration method and device of website legitimacy
CN107122350A (en) * 2017-04-27 2017-09-01 北京易麦克科技有限公司 A kind of feature extraction system and method for many paragraph texts
CN107122350B (en) * 2017-04-27 2021-02-05 北京易麦克科技有限公司 Method of multi-paragraph text feature extraction system
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 Method and device for identifying website
CN109274632B (en) * 2017-07-12 2021-05-11 中国移动通信集团广东有限公司 Method and device for identifying a website
CN110020075A (en) * 2017-10-20 2019-07-16 南京烽火软件科技有限公司 Device is excavated in illegal website automatically
CN107766551A (en) * 2017-10-31 2018-03-06 广东小天才科技有限公司 Website auditing and controlling method based on big data analysis and terminal equipment
CN107835197A (en) * 2017-12-15 2018-03-23 江苏盖亚建筑工程有限公司 A kind of network transmission system
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN109688205A (en) * 2018-12-07 2019-04-26 麒麟合盛网络技术股份有限公司 The hold-up interception method and device of web page resources
CN109688205B (en) * 2018-12-07 2021-06-22 麒麟合盛网络技术股份有限公司 Webpage resource interception method and device
CN109743309B (en) * 2018-12-28 2021-09-10 微梦创科网络科技(中国)有限公司 Illegal request identification method and device and electronic equipment
CN109743309A (en) * 2018-12-28 2019-05-10 微梦创科网络科技(中国)有限公司 A kind of illegal request identification method, device and electronic equipment
CN111382061B (en) * 2018-12-29 2024-05-17 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111382061A (en) * 2018-12-29 2020-07-07 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN110110195A (en) * 2019-05-07 2019-08-09 宜人恒业科技发展(北京)有限公司 A kind of impurity sweep-out method and device
CN110750639A (en) * 2019-07-02 2020-02-04 厦门美域中央信息科技有限公司 Text classification and R language realization based on vector space model
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN111741007A (en) * 2020-07-06 2020-10-02 桦蓥(上海)信息科技有限责任公司 Financial business real-time monitoring system and method based on network layer message analysis
CN114024947A (en) * 2022-01-05 2022-02-08 北京微步在线科技有限公司 Web access method and device based on browser
CN114024947B (en) * 2022-01-05 2022-04-01 北京微步在线科技有限公司 Web access method and device based on browser
EP4418141A1 (en) * 2023-02-17 2024-08-21 Kpmg Llp Document clustering using natural language processing
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Web page filtering method, device, equipment and storage medium
CN116502009B (en) * 2023-06-25 2023-10-31 北京奇虎科技有限公司 Web page filtering methods, devices, equipment and storage media
CN117176483A (en) * 2023-11-03 2023-12-05 北京艾瑞数智科技有限公司 Abnormal URL identification method and device and related products

Also Published As

Publication number Publication date
CN101035128B (en) 2010-04-21

Similar Documents

Publication Publication Date Title
CN101035128A (en) Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
CN101909079B (en) User online behavior data acquisition method in backbone link and system
CN104125209B (en) Malice website prompt method and router
CN101694658B (en) Method for constructing webpage crawler based on repeated removal of news
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN103810425B (en) The detection method of malice network address and device
CN102710795B (en) Hotspot collecting method and device
CN102004764A (en) Internet bad information detection method and system
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN105631050B (en) A kind of method and system that the URL search key of rule-based configuration extracts
CN101231661A (en) Method and system for object-level knowledge mining
CN1588879A (en) Internet content filtering system and method
CN1955963A (en) System and method for searching dates in electronic documents
CN103064984B (en) The recognition methods of spam page and system
CN102779170A (en) System and method for identifying text floor of webpage
CN102622451A (en) System for automatically generating television program labels
CN101075909A (en) Method and system for accounting webstation access information
CN1702651A (en) Recognition method and apparatus for information files of specific types
CN107577788B (en) E-commerce website topic crawler method for automatically structuring data
CN102567337B (en) A kind of method and system by linking quick identification type of webpage
CN105512143A (en) Method and device for web page classification
CN1417709A (en) Information search system and method
CN1912869A (en) Implementing method of network profile
CN101788988A (en) Information extraction method
CN103902619A (en) Internet public opinion monitoring method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100421

Termination date: 20180418

CF01 Termination of patent right due to non-payment of annual fee