CN106294535A - The recognition methods of website and device - Google Patents
The recognition methods of website and device Download PDFInfo
- Publication number
- CN106294535A CN106294535A CN201610571258.6A CN201610571258A CN106294535A CN 106294535 A CN106294535 A CN 106294535A CN 201610571258 A CN201610571258 A CN 201610571258A CN 106294535 A CN106294535 A CN 106294535A
- Authority
- CN
- China
- Prior art keywords
- website
- verified
- page
- content
- comentropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000002159 abnormal effect Effects 0.000 claims abstract description 61
- 230000002547 anomalous effect Effects 0.000 claims abstract description 38
- 238000004364 calculation method Methods 0.000 claims abstract description 23
- 230000008859 change Effects 0.000 claims abstract description 17
- 230000032683 aging Effects 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 4
- 241000208340 Araliaceae Species 0.000 claims description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 2
- 235000008434 ginseng Nutrition 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 16
- 230000008569 process Effects 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 208000001613 Gambling Diseases 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 241000406668 Loxodonta cyclotis Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the invention discloses recognition methods and the device of a kind of website.Described method includes: in setting the time period, obtains at least two history associated with website to be verified and updates the page;Each described history is updated the page and carries out Context resolution, obtain at least one content domain corresponding with each described history refresh page face;The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain;According to comentropy result of calculation, described website to be verified is carried out anomalous identification.The discrimination of the Information Entropy Features that technical scheme uses is good, calculate height simple, ageing, can solve that the discrimination that existing cheating website identification technology brought is the highest, poor real and need to introduce extra artificial mark or the technical problem of data compilation work, optimize existing website identification technology, improve the recognition accuracy of abnormal website.
Description
Technical field
The present embodiments relate to computer processing technology, particularly relate to recognition methods and the device of a kind of website.
Background technology
Information retrieval refers to search required document from the set of information resources or search the information comprised in required document
The process of content.Search engine is exactly the information retrieval tool for searching internet information.The appearance of search engine allows people
The information that obtains from vast resources becomes convenient.After search engine occurs, the thing followed is webpage cheating problem.For economy
Interests or other interests, cheating website misleads search engine by various methods, to improve its page at search engine sequence knot
Position sequence in Guo.Owing to cheating Website quality is the highest, usually comprise the advertisement of the aspects such as advertisement especially pornographic, gambling,
Can have a strong impact on Consumer's Experience, therefore cheating website identifies and belongs to a major issue in information retrieval.Cheating website identifies
The lifting of technology, significant to the effect promoting search engine.
At present, the cheat method change of website of practising fraud frequently, but typically can be summarized as content cheating and link work
The big class of fraud two.Content is practised fraud generally by piling up the mode of focus inquiry (also referred to as Query) in the page to improve the page
Sequence in search-engine results;Link cheating is primarily directed to calculate the page scoring algorithm of PageRank (also referred to as
For PageRank) it is the nomography of prototype, by building linking relationship to improve weight of website, link is practised fraud and is also included by page
The cheating mode that face redirects.Cheating website identify technology be always one of industry study hotspot, including naive Bayesian,
Logistic Regression (also referred to as logistic regression), SVM (Support Vector Machine, support vector machine), integrated
The multiple machine learning methods such as study, degree of depth study have application, and the feature of use includes content characteristic, chain feature etc..Also
The external informations such as user's click behavior are utilized to be identified.
Existing cheating website identifies that the major defect of technology is:, content of text not notable for page structure feature
On do not carry out the cheating page piled up of word of practising fraud, it is difficult to identify in time.The graph model algorithm relying on link relationship characteristic is complicated,
It is difficult to meet the demand of Real time identification;Emerging general Websites and compare minority website, how with emerging cheating net
Stand and distinguish mutually, be also one of difficulty;Practise fraud exactly network upgrade speed it addition, cheating website identification mission faces a major challenge
Hurry up, existing cheating identifying schemes or identification modelling effect elapse in time and gradually lost efficacy.Strengthen study and Active Learning energy
Enough parts solve this problem, however it is necessary that and introduce extra artificial mark or data compilation work.
Summary of the invention
In view of this, embodiments provide recognition methods and the device of a kind of website, to optimize existing website
Identification technology, improves the recognition accuracy of abnormal website.
In first aspect, embodiments provide the recognition methods of a kind of website, including:
In setting the time period, obtain at least two history associated with website to be verified and update the page;
Each described history is updated the page and carries out Context resolution, obtain at least corresponding with each described history refresh page face
Individual content domain;
The content change updated in the page in identical content territory according to each described history, calculates the information of each described content domain
Entropy;
According to comentropy result of calculation, described website to be verified is carried out anomalous identification.
In second aspect, the embodiment of the present invention additionally provides the identification device of a kind of website, including:
History updates page acquisition module, in setting the time period, obtains at least two associated with website to be verified
Individual history updates the page;
Content domain acquisition module, carries out Context resolution for each described history is updated the page, obtains and each described history
At least one content domain that refresh page face is corresponding;
Content domain comentropy computing module, becomes for the content updated in the page in identical content territory according to each described history
Change, calculate the comentropy of each described content domain;
Anomalous identification module, for according to comentropy result of calculation, carries out anomalous identification to described website to be verified.
The embodiment of the present invention, in setting the time period, obtains at least two history refresh page associated with website to be verified
Face;Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face
Hold territory;The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain;
According to comentropy result of calculation, described website to be verified is carried out anomalous identification, owing to the discrimination of Information Entropy Features is good, calculating
Simply, ageing height, can solve that the discrimination that existing cheating website identification technology brought is the highest, poor real and needs
Introduce extra artificial mark or the technical problem of data compilation work, optimize existing website identification technology, improve
The recognition accuracy of abnormal website.
Accompanying drawing explanation
Fig. 1 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention two provides;
Fig. 3 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention three provides;
Fig. 4 is the structure chart identifying device of a kind of website that the embodiment of the present invention four provides.
Detailed description of the invention
In order to make the object, technical solutions and advantages of the present invention clearer, reality concrete to the present invention below in conjunction with the accompanying drawings
Execute example to be described in further detail.It is understood that specific embodiment described herein is used only for explaining the present invention,
Rather than limitation of the invention.
It also should be noted that, for the ease of describe, accompanying drawing illustrate only part related to the present invention rather than
Full content.It should be mentioned that, some exemplary embodiments are described before being discussed in greater detail exemplary embodiment
Become the process or method described as flow chart.Although operations (or step) is described as the process of order by flow chart, but
It is that many of which operation can be implemented concurrently, concomitantly or simultaneously.Additionally, the order of operations can be by again
Arrange.The most described process can be terminated, it is also possible to have the additional step being not included in accompanying drawing.
Described process can correspond to method, function, code, subroutine, subprogram etc..
In order to hereinafter readily appreciate, first the inventive concept of the present invention is simply introduced:
Inventor is found by research: from purpose, and cheating website is to obtain higher ranked, allows in website embedded
Ad content obtain more high access.Wherein, the advertisement classification of cheating website is the most more concentrated, most for gambling, pornographic,
Beautifying medical, gun apparatus etc..The cheating of cheating website is to have mark governed.In order to allow search engine include and obtain height
Sorting position, cheating website often updates content of pages, adds the inquiry of current popular high frequency in the page;Owing to cost is asked
Topic, same page content typically can be replicated in cheating website.In order to tackle the anti-strategy of practising fraud of search engine, the interior of website of practising fraud
Appearance, pattern, network address are also required to frequent updating.
As the above analysis: cheating network upgrade is frequent, and advertising message is contained in cheating website, and these advertising messages
Within certain period, update and infrequently.That is, there is irrational redundancy in some important positions in cheating website, and
Normal especially high-quality website, website need not make this redundancy, because can not provide valuable letter like that more
Breath.
The concept of entropy is introduced theory of information by information-theoretical founder's Shannon, as the tolerance to quantity of information size.Quantity of information
Size relevant to its probabilistic size, entropy is the highest, and uncertainty is the highest, will describe other clearly required information
Measure the biggest.
Namely: from the point of view of theory of information, if normal website updates frequently, illustrate that what it provided contains much information, its entropy
Value can be bigger;If updated infrequently, illustrate that the quantity of information that website provides is little, then entropy is less.Cheating website often updates, in advance
Its entropy of phase is relatively big, but some content domain or some object are because containing advertising message, and these advertising message renewal speed are slow,
Cause its entropy to diminish, i.e. the actual entropy of some content domain directly there are differences with expection entropy.By calculating cheating website
The entropy in different content territory and difference degree thereof, can help effectively to identify cheating website.
By above-mentioned analysis, the proposition that inventor is creative, this concept of comentropy is introduced the identification of abnormal website
Cheng Zhong, by calculating the comentropy of one or more content domain in a website, carries out anomalous identification to this website.
Embodiment one
The flow chart of the recognition methods of a kind of website that Fig. 1 provides for the embodiment of the present invention one, the method for the present embodiment can
To be performed by the identification device of website, this device can realize by the way of hardware and/or software, and typically can be integrated in use
In the server realizing abnormal website identification function.The method of the present embodiment specifically includes:
110, in setting the time period, obtain at least two history associated with website to be verified and update the page.
In the present embodiment, described website to be verified specifically refers to the website needing to carry out anomalous identification.Wherein it is possible to will
The whole websites included in search engine all carry out anomalous identification as website to be verified, but, it is contemplated that abnormal website (typical case
, website of practising fraud) in order to obtain the higher ranking results of position sequence in a search engine, can often update content of pages, therefore may be used
To choose the newly generated page or to have the website updating the page as website to be verified, this also contributes to reduce amount of calculation.
As it was previously stated, the core of the present invention is by analyzing the comentropy of each content domain in a website to be verified next
This website is carried out anomalous identification, and comentropy mainly weighs the uncertainty degree of the content occurred in content domain, therefore needs
Obtaining in the setting time period (such as, 1 hour, 1 day or 1 week etc.), at least two history associated with website to be verified is more
New page, the content updated in the page by analyzing this history to update, determine the letter of each content domain in described website to be verified
Breath entropy.
Wherein, at least two history that described and website to be verified associates updates the page and may include that to be verified with described
At least two history that the website domain name of website is corresponding updates the page;And/or with in described website to be verified same webpage ground
At least two history renewal page that location is corresponding.
In an object lesson, the entitled www.A.com of website domain of a website to be verified, the setting time can be obtained
Whole history corresponding with this website domain name in section update the page and update the page as the history associated with described website to be verified;
Further, it is contemplated that a website can include multiple different types of subpage frame (such as, news website simultaneously
Middle include subpage frames such as " current events ", " amusement " and " physical culture " simultaneously), in order to carry out more fine-grained analysis, it is also possible to obtain
Take whole history corresponding with same web page address (such as: www.A.com/B) in described website to be verified and update the page, as
The history associated with described website to be verified updates the page.
120, each described history is updated the page and carry out Context resolution, obtain corresponding extremely with each described history refresh page face
A few content domain.
In general, a page includes different types of data content, in the present embodiment, by above-mentioned inhomogeneity
The data content of type is defined as territory.Such as: text header, text body, picture header, picture and picture literary composition is precisely described
This etc..By page parsing, namely the HTML (HyperText Markup Language, HTML) to the page
File is analyzed, and the page can be divided into different territories by a page and extract text, the picture etc. comprised in these territories
Content.
In view of the computation complexity of follow-up entropy, in the present embodiment, the described content chosen when calculating comentropy
Territory can include following at least one: text header territory, picture territory, picture header territory, picture describe textview field.
Wherein, described text header territory specifically refers to the page location at one or more text header place, described figure
Sheet territory specifically refers to the page location at one or more picture place, and described picture header territory specifically refers to corresponding with picture
The page location at one or more picture header place, described picture describes textview field and specifically refers to corresponding with picture
Or multiple pictures precisely describe the page location at text place.
130, the content change updated in the page in identical content territory according to each described history, calculates each described content domain
Comentropy.
By the related notion of comentropy, the content change in a content domain is the most frequent, content in this content domain
Uncertainty the biggest, then the comentropy of this content domain is the biggest;Otherwise, the content in a content domain is the most fixing, and this is interior
The uncertainty of the content in appearance territory is the least, then the comentropy of this content domain is the least.
Wherein, comentropy computing formula particularly as follows:
Wherein, x has n kind value: x1…xi…xn, corresponding probability is: P (x1)…P(xi)…P(xn)。
Typically, can calculate each interior according to the frequency of occurrence in each history updates the page of the different content in content domain
Hold the comentropy in territory.
140, according to comentropy result of calculation, described website to be verified is carried out anomalous identification.
One of the present embodiment preferred embodiment in, can be by the comentropy meter of content domain each in website to be verified
The comentropy calculating result and each content domain of a reliable website is compared, and then described website to be verified is carried out abnormal knowledge
Not;
The present embodiment another preferred embodiment in, it is also possible to by the letter in different content territory in website to be verified
Breath entropy is compared, and then described website to be verified is carried out anomalous identification;
The present embodiment another preferred embodiment in, it is also possible to using described comentropy result of calculation as at least
With other abnormal websites, one Information Entropy Features value, identifies that eigenvalue is combined by described Information Entropy Features value, treats described
Checking website carries out anomalous identification.
In general, prior art mainly uses grader that one website to be verified is carried out anomalous identification, by
This grader adds one or more abnormal website and identifies eigenvalue (typical, content characteristic, chain and connect feature etc.)
Complete the identification to abnormal website.In the present embodiment, except can directly use comentropy to identify it to carrying out abnormal website
Outward, it is also possible on the basis of existing abnormal website identification technology, by the comentropy meter of each content domain in website to be verified
Described Information Entropy Features value, as one or more Information Entropy Features value, is identified eigenvalue with other abnormal websites by calculation result
Input together to grader, after identifying that technology is combined with existing abnormal website, described website to be verified is carried out abnormal knowledge
Not, to improve the recognition accuracy of abnormal website further.
The embodiment of the present invention, in setting the time period, obtains at least two history refresh page associated with website to be verified
Face;Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face
Hold territory;The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain;
According to comentropy result of calculation, described website to be verified is carried out anomalous identification, owing to the discrimination of Information Entropy Features is good, calculate
Simply, ageing height, can solve that the discrimination that existing cheating website identification technology brought is the highest, poor real and needs
Introduce extra artificial mark or the technical problem of data compilation work, optimize existing website identification technology, improve
The recognition accuracy of abnormal website.
Embodiment two
Fig. 2 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention two provides.The present embodiment is with above-mentioned reality
It is optimized based on executing example, in the present embodiment, will obtain, in setting the time period, at least two associated with website to be verified
Individual history refresh page mask body is optimized for: in setting the time period, captures in network newly generated by web crawlers, and/or
There is the page of renewal;After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described
Website to be verified;The page included according to described clustering cluster, obtains at least two history associated with described website to be verified
Update the page;
Meanwhile, the content change that will update in the page in identical content territory according to each described history, calculate each described content
The comentropy in territory is specifically optimized for: updates in each described history respectively in the same target content domain of the page, extracts at least one
Comparison object;According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate institute
State the probability of occurrence of comparison object;According to the probability of occurrence of described comparison object, calculate the letter corresponding with described object content territory
Breath entropy.
Accordingly, the method for the present embodiment specifically includes:
210, in setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal.
In the present embodiment, it is contemplated that abnormal website, the ratio website more frequently that especially cheating website generally updates.
Therefore, it can first obtain captures in network newly generated by web crawlers, and has the page of renewal, by these pages
Face merges and clusters according to website, and then can determine corresponding website to be verified.
220, after the page of crawl being clustered according to website domain name, the website corresponding with clustering cluster is treated as described
Checking website.
230, the page included according to described clustering cluster, obtains at least two history associated with described website to be verified
Update the page.
Wherein, if at least two history refresh page mask body that website described and to be verified associates is to be verified with described
At least two history corresponding to the website domain name of website updates the page, the then page included according to described clustering cluster, obtain with
At least two history refresh page mask body of described website to be verified association may include that
The whole pages described clustering cluster included, directly as the history refresh page associated with described website to be verified
Face;
If at least two history refresh page mask body that website described and to be verified associates is and described website to be verified
In at least two history corresponding to same web page address update the page, the then page included according to described clustering cluster, obtain
At least two history refresh page mask body associated with described website to be verified may include that
According to URL, (Uniform Resource Locator, unified resource positions the page described clustering cluster included
Symbol) address is grouped, and wherein, the page in same packet corresponds to an identical URL address;Obtain in same packet and wrap
The page included updates the page as the history associated with described website to be verified.
240, each described history is updated the page and carry out Context resolution, obtain corresponding extremely with each described history refresh page face
A few content domain.
250, respectively in each described history updates the same target content domain of the page, at least one comparison object is extracted.
In the present embodiment, if the content in described object content territory includes text, the most described comparison object can wrap
Include: urtext, semantic signature or semantic category;If the content in described object content territory includes picture, the most described ratio
Object be may include that original image or picture classification.
Wherein, described urtext specifically refers to the content of text directly occurred in certain content domain, such as: text header
Content of text in territory is: " 2016.6.17 day, XX company lists in the U.S. ", the most above-mentioned content of text is urtext;
Semantic signature is the improvement to urtext, i.e. urtext is carried out semantics recognition and process, retains original literary composition
Core semantic content in Ben, and it is expressed as the combination of some core words, the combination of this core word, referred to as semanteme are signed
Name.Continuous precedent, for " 2016.6.17 day, XX company lists in the U.S. " this urtext, its corresponding semantic signature is
" XX company, the U.S., listing ";
Semantic category refers to the semantic category of raw text content.Continuous precedent, for " 2016.6.17 day, XX company is in the U.S.
Listing " this urtext, its corresponding semantic category is " finance and economics ".
It is understood that urtext, semantic signature and semantic category represent the information type that thickness granularity is different,
Accordingly, by calculating the comentropy of these three information type, the informational content measure result that thickness granularity is different can be obtained.?
During actual application, those skilled in the art can choose the letter of different thicknesses granularity according to actual abnormal website accuracy of identification
Breath type is as described comparison object.
Similar, described original image specifically refers to the image content directly occurred in certain content domain, described picture
Classification, specifically refers to picture classification under certain taxonomic hierarchies.
Currently, it will be appreciated by persons skilled in the art that and can also obtain the comparison pair of other forms in content domain
As, it practice, every can the data of clear definition and the page column of identification or page info type all can be as described
Comparison object, this is not limited by the present embodiment.
260, according to described comparison object each described history update the page object content territory in frequency of occurrence, calculate
The probability of occurrence of described comparison object.
In an object lesson, within one day, website to be verified updates the page corresponding to three history, and history updates
The page 1, history update the page 2 and history updates the page 3, and the object content territory chosen is text header territory, the comparison chosen
Object is urtext.
Wherein, update the urtext occurred in the text header territory of the page 1 to include in history: text header 1, text mark
Topic 2 and text header 3;The urtext occurred in text header territory in history updates the page 2 includes: text header 1,
Text header 3 and text header 4;The urtext occurred in text header territory in history updates the page 3 includes: text
Title 3 and text header 5.
Accordingly, occurring in that altogether 8 text headers in above three history updates the page, text header 1 is above-mentioned
Three history update in the page and occur altogether 2 times, and then may determine that the probability of occurrence corresponding with text header is 2/8;Text mark
Topic 2 occurs 1 time in above three history updates the page altogether, and then may determine that the probability of occurrence corresponding with text header is 1/
8;Text header 3 occurs 3 times in above three history updates the page altogether, and then may determine that the appearance corresponding with text header
Probability is 3/8;Text header 4 occurs 1 time in above three history updates the page altogether, and then may determine that and text header pair
The probability of occurrence answered is 1/8;Text header 5 occurs 1 time in above three history updates the page altogether, and then may determine that and literary composition
The probability of occurrence of this title 5 correspondence is 1/8.
270, according to the probability of occurrence of described comparison object, the comentropy corresponding with described object content territory is calculated.
According to comentropy computing formula, comentropy H that can obtain above-mentioned and described object content territory corresponding is:
H=(1/4) log24+(1/8)log28+(3/8)log23/8+(1/8)log28+(1/8)log28。
280, according to comentropy result of calculation, described website to be verified is carried out anomalous identification.
Inventor is by finding after the feature of the various cheating websites of analysis: if in the multiple history corresponding with same website
Updating in the page, the main picture of the page repeats (comentropy of picture is little) in a large number, and picture describes text or text header is fresh
See repetition (comentropy that picture describes text or text header is big), then this website has greater probability to be cheating website;Additionally,
If the other comentropy of picture category exists notable difference with the comentropy of picture header, then this website also has greater probability to be cheating
Website.
Accordingly, one of the present embodiment preferred embodiment in, according to comentropy result of calculation, to described to be verified
Website carries out anomalous identification and may include that
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, then
Determine that described website to be verified is for abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second thresholding
Value, it is determined that described website to be verified is abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value,
Then determine that described website to be verified is for abnormal website.
Wherein, described first threshold value, the second threshold value and the 3rd threshold value can be preset according to practical situation,
This is not limited by the present embodiment.
The technical scheme of the present embodiment, will be from identical by newly generated in screening certain period or have the page of renewal
The page aggregation of website together, and is chosen website to be verified according to polymerization result and is carried out the mode of anomalous identification, compared to will
Whole websites that search engine is included all carry out the mode of anomalous identification, on the premise of not dramatically increasing loss, Ke Yi great
Reduce greatly amount of calculation;Additionally, by website being carried out anomalous identification according to the comentropy difference of each content domain in a website
Mode, it is not necessary to introduce any reference site, only according to the comentropy difference feature in different content territory in website to be verified,
Can realize identifying simply, accurately the technique effect of abnormal website.
On the basis of the various embodiments described above, according to described comparison object in each described object content territory appearance frequency
Secondary, before calculating the probability of occurrence of described comparison object, it is also possible to including:
If it is determined that described comparison object is ageing simple repeated text, then in each described history updates the page, point
The body matter that Huo Qu not associate with described comparison object;If updating in the page in different history, with same target comparison pair
As corresponding body matter differs, then it is different comparison objects by described target comparison object tag.
The reason so arranged is: when calculating comentropy, and to having, ageing identic text needs are special
Process.Such as, as " one week news flash ", " Domestic Briefs " this headline, the body matter corresponding at different time is different,
When calculating comentropy, need to combine body matter and judge.Namely: in history updates the page 1 and the history renewal page 2
All occur in that " one week news flash " this comparison object, if if only adding up the frequency of occurrence of " one week news flash ", then this comparison pair
The probability of occurrence of elephant is 1.But, it is contemplated that " one week news flash " is one and has ageing text, also to continue in history more
New page 1 and history update the body matter that in the page 2, comparison is corresponding with " one week news flash ", if the two is different, then and can be by
History updates " the one week news flash " in the page 1 and " the one week news flash " in the history renewal page 2 is identified as different comparisons pair
As, and then may determine that the probability of occurrence of this comparison object is 1/2.
By above-mentioned setting, the accuracy in computation of comentropy can be improved, and then the identification that can improve abnormal website is accurate
Exactness.
Embodiment three
Fig. 3 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention two provides.The present embodiment is with above-mentioned reality
It is optimized based on executing example, in the present embodiment, according to comentropy result of calculation, described website to be verified will be carried out exception
Identify and be specifically optimized for: according to the data characteristics of described website to be verified, obtain to be verified with described in reliable website list
The reference site of website association;Obtain the comentropy of at least one content domain corresponding with described reference site;Described to be tested
In card website and described reference site, choose at least one key content territory;According to described website to be verified and described ginseng
Examine in website, the comentropy respectively the most corresponding with described key content territory, calculate described website to be verified and described reference site it
Between the diversity factor factor;If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is abnormal
Website.
Accordingly, the method for the present embodiment specifically includes:
310, in setting the time period, obtain at least two history associated with website to be verified and update the page.
320, each described history is updated the page and carry out Context resolution, obtain corresponding extremely with each described history refresh page face
A few content domain.
330, the content change updated in the page in identical content territory according to each described history, calculates each described content domain
Comentropy.
340, according to the data characteristics of described website to be verified, obtain and described website to be verified in reliable website list
The reference site of association.
In the present embodiment, the data characteristics of described website to be verified can include following at least one: set the time period
Interior network upgrade frequency, the new added pages quantity set in the time period and content topic etc..
Wherein, described reliable website list specifically refers to: excavated by User action log or the method such as manual sorting,
The a collection of reliable website determined.
In the present embodiment, it is contemplated that renewal frequency new added pages quantity similar, that set in the time period is similar or interior
Hold the reliable website that theme is similar, its webpage also can have between the comentropy of each content domain certain similarity.Therefore,
By obtaining the reference site similar in described data characteristics with described website to be verified in reliable website list, and lead to
Cross the comentropy difference in each territory in described reference site and described website to be verified, abnormal website can be identified.
350, the comentropy of at least one content domain corresponding with described reference site is obtained.
360, in described website to be verified and described reference site, at least one key content territory is chosen.
Wherein it is possible to obtain the full content territory all included in described website to be verified and described reference site as institute
State key content territory, it is also possible to obtain one or more important content domain that above-mentioned both of which includes (such as, picture territory with
And text header territory etc.) as described key content territory, this is not limited by the present embodiment.
370, according in described website to be verified and described reference site, the most corresponding letter is distinguished with described key content territory
Breath entropy, calculates the diversity factor factor between described website to be verified and described reference site.
One of the present embodiment preferred embodiment in, according to described website to be verified and described reference site
In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site
The different degree factor specifically may include that
In described website to be verified and described reference site, obtain the comentropy corresponding with same key content territory poor
Value is as the described diversity factor factor.
Such as, in website to be verified, the comentropy corresponding with key content territory 1 is A, corresponding with key content territory 2
Comentropy is B;In reference site, the comentropy corresponding with key content territory 1 is C, the comentropy corresponding with key content territory 2
For D;
Then can be using | A-C | and | B-D | as the described diversity factor factor.Wherein, | | represent the symbol that takes absolute value.
The present embodiment another preferred embodiment in, according to described website to be verified and described reference site
In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site
The different degree factor specifically may include that
In described website to be verified, the comentropy the most corresponding with at least two key content territory is constituted the first information
Vector;
In described reference site, the comentropy the most corresponding with described at least two key content territory is constituted the second letter
Breath vector;
Calculate that the described first information is vectorial and distance value between described second information vector is as the described diversity factor factor.
Continuous precedent, in website to be verified, the comentropy corresponding with key content territory 1 is A, corresponding with key content territory 2
Comentropy be B;In reference site, the comentropy corresponding with key content territory 1 is C, the information corresponding with key content territory 2
Entropy is D;
Then corresponding with website to be verified first information vector is [A, B], second information vector corresponding with reference site
For [C, D].
Wherein it is possible to calculate the distance value between two vectors by various modes, typically, calculate both cosine folders
The mode at angle, and using calculated distance value as the described diversity factor factor.
380, judge whether the described diversity factor factor meets and set threshold condition, if so, perform 390.Otherwise, perform
3100。
Wherein, if the described diversity factor factor is comentropy difference, if then the described diversity factor factor meets setting
Threshold condition, it is determined that described website to be verified specifically may include that for abnormal website
If the comentropy difference setting quantity exceedes setting threshold value, and/or the information corresponding with setting key content territory
Entropy difference exceedes setting threshold value, it is determined that described website to be verified is abnormal website;Or
If being weighted at least two comentropy difference suing for peace, the difference accumulated value obtained exceedes setting threshold value, the most really
Fixed described website to be verified is abnormal website.
If the described diversity factor factor is described distance value, if then the described diversity factor factor meets setting threshold value bar
Part, it is determined that described website to be verified specifically may include that for abnormal website
If described distance value exceedes setting threshold value, it is determined that described website to be verified is abnormal website.
390, determine that described website to be verified is for abnormal website.
3100, determine that described website to be verified is normal website.
The technical scheme of the present embodiment is by after the comentropy of each content domain, obtaining in being calculated website to be verified
With the comentropy of each content domain in the reliable website of this website data feature similarity to be verified, comentropies based on both, calculate
Obtain both diversity factor factors, and then website to be verified is carried out the technological means of anomalous identification, it is possible to achieve according to exception
Comentropy difference between website and reliable website, simply, quickly identifies the technique effect of abnormal website, recognition accuracy
Height, real-time is good.
Embodiment four
Fig. 4 is the structure chart identifying device of a kind of website that the embodiment of the present invention four provides.As shown in Figure 4, described dress
Put and include: history updates page acquisition module 41, content domain acquisition module 42, content domain comentropy computing module 43 and exception
Identification module 44, wherein:
History updates page acquisition module 41, in setting the time period, obtains and associates at least with website to be verified
Two history update the page.
Content domain acquisition module 42, carries out Context resolution for each described history is updated the page, obtains and go through described in each
At least one content domain that history refresh page face is corresponding.
Content domain comentropy computing module 43, for the content updated in the page in identical content territory according to each described history
Change, calculates the comentropy of each described content domain.
Anomalous identification module 44, for according to comentropy result of calculation, carries out anomalous identification to described website to be verified.
The embodiment of the present invention, by setting the time period, obtains at least two history associated with website to be verified and updates
The page;Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face
Content domain;The content change updated in the page in identical content territory according to each described history, calculates the information of each described content domain
Entropy;According to comentropy result of calculation, described website to be verified is carried out the technological means of anomalous identification, due to Information Entropy Features
Discrimination is good, calculates height simple, ageing, can solve the discrimination that existing cheating website identification technology brought the highest, real
Time property is poor and needs to introduce extra artificial mark or the technical problem of data compilation work, optimizes existing website and knows
Other technology, improves the recognition accuracy of abnormal website.
On the basis of the various embodiments described above, it is permissible that at least two history that website described and to be verified associates updates the page
Including:
At least two history corresponding with the website domain name of described website to be verified updates the page;And/or
At least two history corresponding with the same web page address in described website to be verified updates the page.
On the basis of the various embodiments described above, described history updates page acquisition module, specifically may be used for:
In setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal;
After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described to be verified
Website;
The page included according to described clustering cluster, obtains at least two history associated with described website to be verified and updates
The page.
On the basis of the various embodiments described above, described content domain can include following at least one:
Text header territory, picture territory, picture header territory, picture describe textview field.
On the basis of the various embodiments described above, described content domain comentropy computing module, specifically may be used for:
Update in each described history respectively in the same target content domain of the page, extract at least one comparison object;
According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate described
The probability of occurrence of comparison object;
According to the probability of occurrence of described comparison object, calculate the comentropy corresponding with described object content territory.
On the basis of the various embodiments described above, if the content in described object content territory includes text, the most described comparison
Object may include that urtext, semantic signature or semantic category;
If the content in described object content territory includes picture, the most described comparison object may include that original image or
Person's picture classification.
On the basis of the various embodiments described above, it is also possible to including: body matter association comparing module, it is used for:
According to described comparison object frequency of occurrence in each described object content territory, calculate going out of described comparison object
Before existing probability, if it is determined that described comparison object is ageing simple repeated text, then in each described history updates the page,
Obtain the body matter associated with described comparison object respectively;
If updated in the page in different history, the body matter corresponding with same target comparison object differs, then will
Described target comparison object tag is different comparison object.
On the basis of the various embodiments described above, described anomalous identification module, specifically may include that
Reference site acquiring unit, for the data characteristics according to described website to be verified, obtains in reliable website list
Take the reference site associated with described website to be verified;
Reference site comentropy acquiring unit, for obtaining the letter of at least one content domain corresponding with described reference site
Breath entropy;
Unit is chosen in key content territory, in described website to be verified and described reference site, chooses at least one
Individual key content territory;
Diversity factor factor calculating unit, for according in described website to be verified and described reference site, with described pass
The comentropy that key content domain is the most corresponding, calculates the diversity factor factor between described website to be verified and described reference site;
Abnormal website identifies subelement, sets threshold condition if met for the described diversity factor factor, it is determined that described
Website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically may be used for:
In described website to be verified and described reference site, obtain the comentropy corresponding with same key content territory poor
Value is as the described diversity factor factor;
Abnormal website identifies that subelement specifically may be used for:
If the comentropy difference setting quantity exceedes setting threshold value, and/or the information corresponding with setting key content territory
Entropy difference exceedes setting threshold value, it is determined that described website to be verified is abnormal website;Or
If being weighted at least two comentropy difference suing for peace, the difference accumulated value obtained exceedes setting threshold value, the most really
Fixed described website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically may be used for:
In described website to be verified, the comentropy the most corresponding with at least two key content territory is constituted the first information
Vector;
In described reference site, the comentropy the most corresponding with described at least two key content territory is constituted the second letter
Breath vector;
Calculate that the described first information is vectorial and distance value between described second information vector is as the described diversity factor factor;
Described abnormal website identifies that subelement specifically may be used for:
If described distance value exceedes setting threshold value, it is determined that described website to be verified is abnormal website.
On the basis of the various embodiments described above, the data characteristics of described website to be verified can include following at least one:
Set the network upgrade frequency in the time period, the new added pages quantity set in the time period and content topic.
On the basis of the various embodiments described above, described anomalous identification module, specifically may include that
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, then
Determine that described website to be verified is for abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second thresholding
Value, it is determined that described website to be verified is abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value,
Then determine that described website to be verified is for abnormal website.
On the basis of the various embodiments described above, described anomalous identification module, specifically may include that
Using described comentropy result of calculation as at least one Information Entropy Features value, by described Information Entropy Features value and other
Abnormal website identifies that eigenvalue is combined, and described website to be verified is carried out anomalous identification.
The identification device of the website that the embodiment of the present invention is provided can be used for performing the net that any embodiment of the present invention provides
The recognition methods stood, possesses corresponding functional module, it is achieved identical beneficial effect.
Obviously, it will be understood by those skilled in the art that each module or each step of the above-mentioned present invention can be by as above
Described server implementation.Alternatively, the embodiment of the present invention can realize by the executable program of computer installation, thus can
Performing by processor to be stored in storing in device, described program can be stored in a kind of computer-readable storage
In medium, storage medium mentioned above can be read only memory, disk or CD etc.;Or they are fabricated to respectively each
Individual integrated circuit modules, or the multiple modules in them or step are fabricated to single integrated circuit module realize.So,
The present invention is not restricted to the combination of any specific hardware and software.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for those skilled in the art
For, the present invention can have various change and change.All made within spirit and principles of the present invention any amendment, equivalent
Replacement, improvement etc., should be included within the scope of the present invention.
Claims (20)
1. the recognition methods of a website, it is characterised in that including:
In setting the time period, obtain at least two history associated with website to be verified and update the page;
Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face
Hold territory;
The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain;
According to comentropy result of calculation, described website to be verified is carried out anomalous identification.
Method the most according to claim 1, it is characterised in that at least two history that website described and to be verified associates is more
New page includes:
At least two history corresponding with the website domain name of described website to be verified updates the page;And/or
At least two history corresponding with the same web page address in described website to be verified updates the page.
Method the most according to claim 1 and 2, it is characterised in that in setting the time period, obtains and closes with website to be verified
At least two history refresh page face of connection includes:
In setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal;
After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described net to be verified
Stand;
The page included according to described clustering cluster, obtains at least two history refresh page associated with described website to be verified
Face.
Method the most according to claim 1, it is characterised in that described content domain include following at least one:
Text header territory, picture territory, picture header territory, picture describe textview field.
Method the most according to claim 1, it is characterised in that according in identical content territory in each described history renewal page
Content change, the comentropy calculating each described content domain includes:
Update in each described history respectively in the same target content domain of the page, extract at least one comparison object;
According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate described comparison
The probability of occurrence of object;
According to the probability of occurrence of described comparison object, calculate the comentropy corresponding with described object content territory.
Method the most according to claim 5, it is characterised in that:
If the content in described object content territory includes text, the most described comparison object includes: urtext, semantic signature or
Person's semantic category;
If the content in described object content territory includes picture, the most described comparison object includes: original image or picture category
Not.
7. according to the method described in claim 5 or 6, it is characterised in that according to described comparison object in each described target
Hold the frequency of occurrence in territory, before calculating the probability of occurrence of described comparison object, also include:
If it is determined that described comparison object is ageing simple repeated text, then, in each described history updates the page, obtain respectively
Take the body matter associated with described comparison object;
If updated in the page in different history, the body matter corresponding with same target comparison object differs, then by described
Target comparison object tag is different comparison object.
Method the most according to claim 1, it is characterised in that according to comentropy result of calculation, to described website to be verified
Carry out anomalous identification to include:
According to the data characteristics of described website to be verified, reliable website list obtains the ginseng associated with described website to be verified
Examine website;
Obtain the comentropy of at least one content domain corresponding with described reference site;
In described website to be verified and described reference site, choose at least one key content territory;
According in described website to be verified and described reference site, distinguish the most corresponding comentropy, meter with described key content territory
Calculate the diversity factor factor between described website to be verified and described reference site;
If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is abnormal website.
Method the most according to claim 8, it is characterised in that according to described website to be verified and described reference site
In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site
The different degree factor specifically includes:
In described website to be verified and described reference site, obtain the comentropy difference corresponding with same key content territory and make
For the described diversity factor factor;
If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is that abnormal website is concrete
Including:
Comentropy if the comentropy difference setting quantity exceedes setting threshold value and/or corresponding with setting key content territory is poor
Value exceedes setting threshold value, it is determined that described website to be verified is abnormal website;Or
If being weighted at least two comentropy difference suing for peace, the difference accumulated value obtained exceedes setting threshold value, it is determined that institute
State website to be verified for abnormal website.
Method the most according to claim 8, it is characterised in that according to described website to be verified and described reference site
In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site
The different degree factor specifically includes:
In described website to be verified, by the comentropy composition first information respectively the most corresponding with at least two key content territory to
Amount;
In described reference site, the comentropy the most corresponding with described at least two key content territory is constituted the second information to
Amount;
Calculate that the described first information is vectorial and distance value between described second information vector is as the described diversity factor factor;
If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is that abnormal website is concrete
Including:
If described distance value exceedes setting threshold value, it is determined that described website to be verified is abnormal website.
11. methods described in-10 any one according to Claim 8, it is characterised in that the data characteristics bag of described website to be verified
Include following at least one:
Set the network upgrade frequency in the time period, the new added pages quantity set in the time period and content topic.
12. methods according to claim 1, it is characterised in that according to comentropy result of calculation, to described website to be verified
Carry out anomalous identification to include:
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, it is determined that
Described website to be verified is abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second threshold value, then
Determine that described website to be verified is for abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value, the most really
Fixed described website to be verified is abnormal website.
13. methods according to claim 1, it is characterised in that according to comentropy result of calculation, to described website to be verified
Carry out anomalous identification to include:
Using described comentropy result of calculation as at least one Information Entropy Features value, described Information Entropy Features value is abnormal with other
Website identifies that eigenvalue is combined, and described website to be verified is carried out anomalous identification.
The identification device of 14. 1 kinds of websites, it is characterised in that including:
History updates page acquisition module, in setting the time period, obtains at least two associated with website to be verified and goes through
History updates the page;
Content domain acquisition module, carries out Context resolution for each described history is updated the page, obtains and updates with each described history
At least one content domain that the page is corresponding;
Content domain comentropy computing module, for the content change updated in the page in identical content territory according to each described history,
Calculate the comentropy of each described content domain;
Anomalous identification module, for according to comentropy result of calculation, carries out anomalous identification to described website to be verified.
15. devices according to claim 14, it is characterised in that described history updates page acquisition module, specifically for:
In setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal;
After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described net to be verified
Stand;
The page included according to described clustering cluster, obtains at least two history refresh page associated with described website to be verified
Face.
16. devices according to claim 14, it is characterised in that described content domain comentropy computing module, specifically for:
Update in each described history respectively in the same target content domain of the page, extract at least one comparison object;
According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate described comparison
The probability of occurrence of object;
According to the probability of occurrence of described comparison object, calculate the comentropy corresponding with described object content territory.
17. devices according to claim 16, it is characterised in that also include: body matter association comparing module, are used for:
According to described comparison object frequency of occurrence in each described object content territory, the appearance calculating described comparison object is general
Before rate, if it is determined that described comparison object is ageing simple repeated text, then in each described history updates the page, respectively
Obtain the body matter associated with described comparison object;
If updated in the page in different history, the body matter corresponding with same target comparison object differs, then by described
Target comparison object tag is different comparison object.
18. devices according to claim 14, it is characterised in that described anomalous identification module, specifically include:
Reference site acquiring unit, for according to the data characteristics of described website to be verified, obtain in reliable website list with
The reference site of described website to be verified association;
Reference site comentropy acquiring unit, for obtaining the information of at least one content domain corresponding with described reference site
Entropy;
Unit is chosen in key content territory, in described website to be verified and described reference site, chooses at least one and closes
Key content domain;
Diversity factor factor calculating unit, for according in described website to be verified and described reference site, and in described key
Hold the comentropy that territory is the most corresponding, calculate the diversity factor factor between described website to be verified and described reference site;
Abnormal website identifies subelement, sets threshold condition if met for the described diversity factor factor, it is determined that described to be tested
Card website is abnormal website.
19. devices according to claim 14, it is characterised in that described anomalous identification module, specifically include:
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, it is determined that
Described website to be verified is abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second threshold value, then
Determine that described website to be verified is for abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value, the most really
Fixed described website to be verified is abnormal website.
20. devices according to claim 14, it is characterised in that described anomalous identification module, specifically include:
Using described comentropy result of calculation as at least one Information Entropy Features value, described Information Entropy Features value is abnormal with other
Website identifies that eigenvalue is combined, and described website to be verified is carried out anomalous identification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610571258.6A CN106294535B (en) | 2016-07-19 | 2016-07-19 | The recognition methods of website and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610571258.6A CN106294535B (en) | 2016-07-19 | 2016-07-19 | The recognition methods of website and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106294535A true CN106294535A (en) | 2017-01-04 |
CN106294535B CN106294535B (en) | 2019-06-25 |
Family
ID=57651792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610571258.6A Active CN106294535B (en) | 2016-07-19 | 2016-07-19 | The recognition methods of website and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294535B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451180A (en) * | 2017-06-13 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Identify method, apparatus, equipment and the computer-readable storage medium of website affinity |
CN108280110A (en) * | 2017-05-15 | 2018-07-13 | 广州市动景计算机科技有限公司 | Website contrast difference's method, apparatus and client |
CN109150817A (en) * | 2017-11-24 | 2019-01-04 | 新华三信息安全技术有限公司 | A kind of web-page requests recognition methods and device |
CN109800378A (en) * | 2019-01-23 | 2019-05-24 | 北京字节跳动网络技术有限公司 | Content processing method, device and electronic equipment based on custom browser |
CN109818828A (en) * | 2019-02-20 | 2019-05-28 | 成都嗨翻屋科技有限公司 | A kind of distributed reptile system monitoring method and device |
CN110716778A (en) * | 2019-09-10 | 2020-01-21 | 阿里巴巴集团控股有限公司 | Application compatibility testing method, device and system |
CN111460763A (en) * | 2020-03-02 | 2020-07-28 | 南京南瑞继保电气有限公司 | Method, device and equipment for marking file differences and computer-readable storage medium |
CN113554131A (en) * | 2021-09-22 | 2021-10-26 | 四川大学华西医院 | Medical image processing and analysis method, computer equipment, system and storage medium |
CN113841156A (en) * | 2019-05-27 | 2021-12-24 | 西门子股份公司 | Control method and device based on image recognition |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281521A (en) * | 2007-04-05 | 2008-10-08 | 中国科学院自动化研究所 | A sensitive web page filtering method and system based on multi-classifier fusion |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN105205061A (en) * | 2014-06-12 | 2015-12-30 | 中国银联股份有限公司 | Method for acquiring page information of E-commerce website |
-
2016
- 2016-07-19 CN CN201610571258.6A patent/CN106294535B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281521A (en) * | 2007-04-05 | 2008-10-08 | 中国科学院自动化研究所 | A sensitive web page filtering method and system based on multi-classifier fusion |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN105205061A (en) * | 2014-06-12 | 2015-12-30 | 中国银联股份有限公司 | Method for acquiring page information of E-commerce website |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280110A (en) * | 2017-05-15 | 2018-07-13 | 广州市动景计算机科技有限公司 | Website contrast difference's method, apparatus and client |
CN107451180A (en) * | 2017-06-13 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Identify method, apparatus, equipment and the computer-readable storage medium of website affinity |
CN109150817A (en) * | 2017-11-24 | 2019-01-04 | 新华三信息安全技术有限公司 | A kind of web-page requests recognition methods and device |
CN109150817B (en) * | 2017-11-24 | 2020-11-27 | 新华三信息安全技术有限公司 | Webpage request identification method and device |
CN109800378A (en) * | 2019-01-23 | 2019-05-24 | 北京字节跳动网络技术有限公司 | Content processing method, device and electronic equipment based on custom browser |
CN109818828A (en) * | 2019-02-20 | 2019-05-28 | 成都嗨翻屋科技有限公司 | A kind of distributed reptile system monitoring method and device |
CN113841156A (en) * | 2019-05-27 | 2021-12-24 | 西门子股份公司 | Control method and device based on image recognition |
CN110716778A (en) * | 2019-09-10 | 2020-01-21 | 阿里巴巴集团控股有限公司 | Application compatibility testing method, device and system |
CN110716778B (en) * | 2019-09-10 | 2023-09-26 | 创新先进技术有限公司 | Application compatibility testing method, device and system |
CN111460763A (en) * | 2020-03-02 | 2020-07-28 | 南京南瑞继保电气有限公司 | Method, device and equipment for marking file differences and computer-readable storage medium |
CN113554131A (en) * | 2021-09-22 | 2021-10-26 | 四川大学华西医院 | Medical image processing and analysis method, computer equipment, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106294535B (en) | 2019-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294535A (en) | The recognition methods of website and device | |
US7685201B2 (en) | Person disambiguation using name entity extraction-based clustering | |
Sun et al. | Dom based content extraction via text density | |
WO2021159632A1 (en) | Intelligent questioning and answering method and apparatus, computer device, and computer storage medium | |
US9449271B2 (en) | Classifying resources using a deep network | |
US8346701B2 (en) | Answer ranking in community question-answering sites | |
CN111813905B (en) | Corpus generation method, corpus generation device, computer equipment and storage medium | |
CN108763321B (en) | Related entity recommendation method based on large-scale related entity network | |
CN113822067A (en) | Key information extraction method and device, computer equipment and storage medium | |
US20200004792A1 (en) | Automated website data collection method | |
AU2005203239A1 (en) | Phrase-based indexing in an information retrieval system | |
US20100211533A1 (en) | Extracting structured data from web forums | |
Reinanda et al. | Document filtering for long-tail entities | |
CN113722478B (en) | Multi-dimensional feature fusion similar event calculation method and system and electronic equipment | |
CN104715063A (en) | Search ranking method and search ranking device | |
CN106599215A (en) | Question generation method and question generation system based on deep learning | |
CN104881428A (en) | Information graph extracting and retrieving method and device for information graph webpages | |
CN107391638A (en) | The new ideas of rule-associated model find method and device | |
Kang et al. | Learning to re-rank web search results with multiple pairwise features | |
Saha et al. | A large scale study of SVM based methods for abstract screening in systematic reviews | |
CN106293114B (en) | Predict the method and device of user's word to be entered | |
Zhe et al. | An adaptive topic tracking approach based on single-pass clustering with sliding time window | |
CN116561402A (en) | Method, device and server for acquiring target content information in webpage | |
JP2010282403A (en) | Document retrieval method | |
Mukherjee et al. | Browsing fatigue in handhelds: semantic bookmarking spells relief |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |