[go: up one dir, main page]

CN106294535A - The recognition methods of website and device - Google Patents

The recognition methods of website and device Download PDF

Info

Publication number
CN106294535A
CN106294535A CN201610571258.6A CN201610571258A CN106294535A CN 106294535 A CN106294535 A CN 106294535A CN 201610571258 A CN201610571258 A CN 201610571258A CN 106294535 A CN106294535 A CN 106294535A
Authority
CN
China
Prior art keywords
website
verified
page
content
comentropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610571258.6A
Other languages
Chinese (zh)
Other versions
CN106294535B (en
Inventor
邹红建
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610571258.6A priority Critical patent/CN106294535B/en
Publication of CN106294535A publication Critical patent/CN106294535A/en
Application granted granted Critical
Publication of CN106294535B publication Critical patent/CN106294535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses recognition methods and the device of a kind of website.Described method includes: in setting the time period, obtains at least two history associated with website to be verified and updates the page;Each described history is updated the page and carries out Context resolution, obtain at least one content domain corresponding with each described history refresh page face;The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain;According to comentropy result of calculation, described website to be verified is carried out anomalous identification.The discrimination of the Information Entropy Features that technical scheme uses is good, calculate height simple, ageing, can solve that the discrimination that existing cheating website identification technology brought is the highest, poor real and need to introduce extra artificial mark or the technical problem of data compilation work, optimize existing website identification technology, improve the recognition accuracy of abnormal website.

Description

The recognition methods of website and device
Technical field
The present embodiments relate to computer processing technology, particularly relate to recognition methods and the device of a kind of website.
Background technology
Information retrieval refers to search required document from the set of information resources or search the information comprised in required document The process of content.Search engine is exactly the information retrieval tool for searching internet information.The appearance of search engine allows people The information that obtains from vast resources becomes convenient.After search engine occurs, the thing followed is webpage cheating problem.For economy Interests or other interests, cheating website misleads search engine by various methods, to improve its page at search engine sequence knot Position sequence in Guo.Owing to cheating Website quality is the highest, usually comprise the advertisement of the aspects such as advertisement especially pornographic, gambling, Can have a strong impact on Consumer's Experience, therefore cheating website identifies and belongs to a major issue in information retrieval.Cheating website identifies The lifting of technology, significant to the effect promoting search engine.
At present, the cheat method change of website of practising fraud frequently, but typically can be summarized as content cheating and link work The big class of fraud two.Content is practised fraud generally by piling up the mode of focus inquiry (also referred to as Query) in the page to improve the page Sequence in search-engine results;Link cheating is primarily directed to calculate the page scoring algorithm of PageRank (also referred to as For PageRank) it is the nomography of prototype, by building linking relationship to improve weight of website, link is practised fraud and is also included by page The cheating mode that face redirects.Cheating website identify technology be always one of industry study hotspot, including naive Bayesian, Logistic Regression (also referred to as logistic regression), SVM (Support Vector Machine, support vector machine), integrated The multiple machine learning methods such as study, degree of depth study have application, and the feature of use includes content characteristic, chain feature etc..Also The external informations such as user's click behavior are utilized to be identified.
Existing cheating website identifies that the major defect of technology is:, content of text not notable for page structure feature On do not carry out the cheating page piled up of word of practising fraud, it is difficult to identify in time.The graph model algorithm relying on link relationship characteristic is complicated, It is difficult to meet the demand of Real time identification;Emerging general Websites and compare minority website, how with emerging cheating net Stand and distinguish mutually, be also one of difficulty;Practise fraud exactly network upgrade speed it addition, cheating website identification mission faces a major challenge Hurry up, existing cheating identifying schemes or identification modelling effect elapse in time and gradually lost efficacy.Strengthen study and Active Learning energy Enough parts solve this problem, however it is necessary that and introduce extra artificial mark or data compilation work.
Summary of the invention
In view of this, embodiments provide recognition methods and the device of a kind of website, to optimize existing website Identification technology, improves the recognition accuracy of abnormal website.
In first aspect, embodiments provide the recognition methods of a kind of website, including:
In setting the time period, obtain at least two history associated with website to be verified and update the page;
Each described history is updated the page and carries out Context resolution, obtain at least corresponding with each described history refresh page face Individual content domain;
The content change updated in the page in identical content territory according to each described history, calculates the information of each described content domain Entropy;
According to comentropy result of calculation, described website to be verified is carried out anomalous identification.
In second aspect, the embodiment of the present invention additionally provides the identification device of a kind of website, including:
History updates page acquisition module, in setting the time period, obtains at least two associated with website to be verified Individual history updates the page;
Content domain acquisition module, carries out Context resolution for each described history is updated the page, obtains and each described history At least one content domain that refresh page face is corresponding;
Content domain comentropy computing module, becomes for the content updated in the page in identical content territory according to each described history Change, calculate the comentropy of each described content domain;
Anomalous identification module, for according to comentropy result of calculation, carries out anomalous identification to described website to be verified.
The embodiment of the present invention, in setting the time period, obtains at least two history refresh page associated with website to be verified Face;Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face Hold territory;The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain; According to comentropy result of calculation, described website to be verified is carried out anomalous identification, owing to the discrimination of Information Entropy Features is good, calculating Simply, ageing height, can solve that the discrimination that existing cheating website identification technology brought is the highest, poor real and needs Introduce extra artificial mark or the technical problem of data compilation work, optimize existing website identification technology, improve The recognition accuracy of abnormal website.
Accompanying drawing explanation
Fig. 1 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention two provides;
Fig. 3 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention three provides;
Fig. 4 is the structure chart identifying device of a kind of website that the embodiment of the present invention four provides.
Detailed description of the invention
In order to make the object, technical solutions and advantages of the present invention clearer, reality concrete to the present invention below in conjunction with the accompanying drawings Execute example to be described in further detail.It is understood that specific embodiment described herein is used only for explaining the present invention, Rather than limitation of the invention.
It also should be noted that, for the ease of describe, accompanying drawing illustrate only part related to the present invention rather than Full content.It should be mentioned that, some exemplary embodiments are described before being discussed in greater detail exemplary embodiment Become the process or method described as flow chart.Although operations (or step) is described as the process of order by flow chart, but It is that many of which operation can be implemented concurrently, concomitantly or simultaneously.Additionally, the order of operations can be by again Arrange.The most described process can be terminated, it is also possible to have the additional step being not included in accompanying drawing. Described process can correspond to method, function, code, subroutine, subprogram etc..
In order to hereinafter readily appreciate, first the inventive concept of the present invention is simply introduced:
Inventor is found by research: from purpose, and cheating website is to obtain higher ranked, allows in website embedded Ad content obtain more high access.Wherein, the advertisement classification of cheating website is the most more concentrated, most for gambling, pornographic, Beautifying medical, gun apparatus etc..The cheating of cheating website is to have mark governed.In order to allow search engine include and obtain height Sorting position, cheating website often updates content of pages, adds the inquiry of current popular high frequency in the page;Owing to cost is asked Topic, same page content typically can be replicated in cheating website.In order to tackle the anti-strategy of practising fraud of search engine, the interior of website of practising fraud Appearance, pattern, network address are also required to frequent updating.
As the above analysis: cheating network upgrade is frequent, and advertising message is contained in cheating website, and these advertising messages Within certain period, update and infrequently.That is, there is irrational redundancy in some important positions in cheating website, and Normal especially high-quality website, website need not make this redundancy, because can not provide valuable letter like that more Breath.
The concept of entropy is introduced theory of information by information-theoretical founder's Shannon, as the tolerance to quantity of information size.Quantity of information Size relevant to its probabilistic size, entropy is the highest, and uncertainty is the highest, will describe other clearly required information Measure the biggest.
Namely: from the point of view of theory of information, if normal website updates frequently, illustrate that what it provided contains much information, its entropy Value can be bigger;If updated infrequently, illustrate that the quantity of information that website provides is little, then entropy is less.Cheating website often updates, in advance Its entropy of phase is relatively big, but some content domain or some object are because containing advertising message, and these advertising message renewal speed are slow, Cause its entropy to diminish, i.e. the actual entropy of some content domain directly there are differences with expection entropy.By calculating cheating website The entropy in different content territory and difference degree thereof, can help effectively to identify cheating website.
By above-mentioned analysis, the proposition that inventor is creative, this concept of comentropy is introduced the identification of abnormal website Cheng Zhong, by calculating the comentropy of one or more content domain in a website, carries out anomalous identification to this website.
Embodiment one
The flow chart of the recognition methods of a kind of website that Fig. 1 provides for the embodiment of the present invention one, the method for the present embodiment can To be performed by the identification device of website, this device can realize by the way of hardware and/or software, and typically can be integrated in use In the server realizing abnormal website identification function.The method of the present embodiment specifically includes:
110, in setting the time period, obtain at least two history associated with website to be verified and update the page.
In the present embodiment, described website to be verified specifically refers to the website needing to carry out anomalous identification.Wherein it is possible to will The whole websites included in search engine all carry out anomalous identification as website to be verified, but, it is contemplated that abnormal website (typical case , website of practising fraud) in order to obtain the higher ranking results of position sequence in a search engine, can often update content of pages, therefore may be used To choose the newly generated page or to have the website updating the page as website to be verified, this also contributes to reduce amount of calculation.
As it was previously stated, the core of the present invention is by analyzing the comentropy of each content domain in a website to be verified next This website is carried out anomalous identification, and comentropy mainly weighs the uncertainty degree of the content occurred in content domain, therefore needs Obtaining in the setting time period (such as, 1 hour, 1 day or 1 week etc.), at least two history associated with website to be verified is more New page, the content updated in the page by analyzing this history to update, determine the letter of each content domain in described website to be verified Breath entropy.
Wherein, at least two history that described and website to be verified associates updates the page and may include that to be verified with described At least two history that the website domain name of website is corresponding updates the page;And/or with in described website to be verified same webpage ground At least two history renewal page that location is corresponding.
In an object lesson, the entitled www.A.com of website domain of a website to be verified, the setting time can be obtained Whole history corresponding with this website domain name in section update the page and update the page as the history associated with described website to be verified; Further, it is contemplated that a website can include multiple different types of subpage frame (such as, news website simultaneously Middle include subpage frames such as " current events ", " amusement " and " physical culture " simultaneously), in order to carry out more fine-grained analysis, it is also possible to obtain Take whole history corresponding with same web page address (such as: www.A.com/B) in described website to be verified and update the page, as The history associated with described website to be verified updates the page.
120, each described history is updated the page and carry out Context resolution, obtain corresponding extremely with each described history refresh page face A few content domain.
In general, a page includes different types of data content, in the present embodiment, by above-mentioned inhomogeneity The data content of type is defined as territory.Such as: text header, text body, picture header, picture and picture literary composition is precisely described This etc..By page parsing, namely the HTML (HyperText Markup Language, HTML) to the page File is analyzed, and the page can be divided into different territories by a page and extract text, the picture etc. comprised in these territories Content.
In view of the computation complexity of follow-up entropy, in the present embodiment, the described content chosen when calculating comentropy Territory can include following at least one: text header territory, picture territory, picture header territory, picture describe textview field.
Wherein, described text header territory specifically refers to the page location at one or more text header place, described figure Sheet territory specifically refers to the page location at one or more picture place, and described picture header territory specifically refers to corresponding with picture The page location at one or more picture header place, described picture describes textview field and specifically refers to corresponding with picture Or multiple pictures precisely describe the page location at text place.
130, the content change updated in the page in identical content territory according to each described history, calculates each described content domain Comentropy.
By the related notion of comentropy, the content change in a content domain is the most frequent, content in this content domain Uncertainty the biggest, then the comentropy of this content domain is the biggest;Otherwise, the content in a content domain is the most fixing, and this is interior The uncertainty of the content in appearance territory is the least, then the comentropy of this content domain is the least.
Wherein, comentropy computing formula particularly as follows:
H ( X ) = Σ i = 1 n P ( x i ) I ( x i ) = - Σ i = 1 n P ( x i ) log 2 P ( x i ) ;
Wherein, x has n kind value: x1…xi…xn, corresponding probability is: P (x1)…P(xi)…P(xn)。
Typically, can calculate each interior according to the frequency of occurrence in each history updates the page of the different content in content domain Hold the comentropy in territory.
140, according to comentropy result of calculation, described website to be verified is carried out anomalous identification.
One of the present embodiment preferred embodiment in, can be by the comentropy meter of content domain each in website to be verified The comentropy calculating result and each content domain of a reliable website is compared, and then described website to be verified is carried out abnormal knowledge Not;
The present embodiment another preferred embodiment in, it is also possible to by the letter in different content territory in website to be verified Breath entropy is compared, and then described website to be verified is carried out anomalous identification;
The present embodiment another preferred embodiment in, it is also possible to using described comentropy result of calculation as at least With other abnormal websites, one Information Entropy Features value, identifies that eigenvalue is combined by described Information Entropy Features value, treats described Checking website carries out anomalous identification.
In general, prior art mainly uses grader that one website to be verified is carried out anomalous identification, by This grader adds one or more abnormal website and identifies eigenvalue (typical, content characteristic, chain and connect feature etc.) Complete the identification to abnormal website.In the present embodiment, except can directly use comentropy to identify it to carrying out abnormal website Outward, it is also possible on the basis of existing abnormal website identification technology, by the comentropy meter of each content domain in website to be verified Described Information Entropy Features value, as one or more Information Entropy Features value, is identified eigenvalue with other abnormal websites by calculation result Input together to grader, after identifying that technology is combined with existing abnormal website, described website to be verified is carried out abnormal knowledge Not, to improve the recognition accuracy of abnormal website further.
The embodiment of the present invention, in setting the time period, obtains at least two history refresh page associated with website to be verified Face;Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face Hold territory;The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain; According to comentropy result of calculation, described website to be verified is carried out anomalous identification, owing to the discrimination of Information Entropy Features is good, calculate Simply, ageing height, can solve that the discrimination that existing cheating website identification technology brought is the highest, poor real and needs Introduce extra artificial mark or the technical problem of data compilation work, optimize existing website identification technology, improve The recognition accuracy of abnormal website.
Embodiment two
Fig. 2 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention two provides.The present embodiment is with above-mentioned reality It is optimized based on executing example, in the present embodiment, will obtain, in setting the time period, at least two associated with website to be verified Individual history refresh page mask body is optimized for: in setting the time period, captures in network newly generated by web crawlers, and/or There is the page of renewal;After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described Website to be verified;The page included according to described clustering cluster, obtains at least two history associated with described website to be verified Update the page;
Meanwhile, the content change that will update in the page in identical content territory according to each described history, calculate each described content The comentropy in territory is specifically optimized for: updates in each described history respectively in the same target content domain of the page, extracts at least one Comparison object;According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate institute State the probability of occurrence of comparison object;According to the probability of occurrence of described comparison object, calculate the letter corresponding with described object content territory Breath entropy.
Accordingly, the method for the present embodiment specifically includes:
210, in setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal.
In the present embodiment, it is contemplated that abnormal website, the ratio website more frequently that especially cheating website generally updates. Therefore, it can first obtain captures in network newly generated by web crawlers, and has the page of renewal, by these pages Face merges and clusters according to website, and then can determine corresponding website to be verified.
220, after the page of crawl being clustered according to website domain name, the website corresponding with clustering cluster is treated as described Checking website.
230, the page included according to described clustering cluster, obtains at least two history associated with described website to be verified Update the page.
Wherein, if at least two history refresh page mask body that website described and to be verified associates is to be verified with described At least two history corresponding to the website domain name of website updates the page, the then page included according to described clustering cluster, obtain with At least two history refresh page mask body of described website to be verified association may include that
The whole pages described clustering cluster included, directly as the history refresh page associated with described website to be verified Face;
If at least two history refresh page mask body that website described and to be verified associates is and described website to be verified In at least two history corresponding to same web page address update the page, the then page included according to described clustering cluster, obtain At least two history refresh page mask body associated with described website to be verified may include that
According to URL, (Uniform Resource Locator, unified resource positions the page described clustering cluster included Symbol) address is grouped, and wherein, the page in same packet corresponds to an identical URL address;Obtain in same packet and wrap The page included updates the page as the history associated with described website to be verified.
240, each described history is updated the page and carry out Context resolution, obtain corresponding extremely with each described history refresh page face A few content domain.
250, respectively in each described history updates the same target content domain of the page, at least one comparison object is extracted.
In the present embodiment, if the content in described object content territory includes text, the most described comparison object can wrap Include: urtext, semantic signature or semantic category;If the content in described object content territory includes picture, the most described ratio Object be may include that original image or picture classification.
Wherein, described urtext specifically refers to the content of text directly occurred in certain content domain, such as: text header Content of text in territory is: " 2016.6.17 day, XX company lists in the U.S. ", the most above-mentioned content of text is urtext;
Semantic signature is the improvement to urtext, i.e. urtext is carried out semantics recognition and process, retains original literary composition Core semantic content in Ben, and it is expressed as the combination of some core words, the combination of this core word, referred to as semanteme are signed Name.Continuous precedent, for " 2016.6.17 day, XX company lists in the U.S. " this urtext, its corresponding semantic signature is " XX company, the U.S., listing ";
Semantic category refers to the semantic category of raw text content.Continuous precedent, for " 2016.6.17 day, XX company is in the U.S. Listing " this urtext, its corresponding semantic category is " finance and economics ".
It is understood that urtext, semantic signature and semantic category represent the information type that thickness granularity is different, Accordingly, by calculating the comentropy of these three information type, the informational content measure result that thickness granularity is different can be obtained.? During actual application, those skilled in the art can choose the letter of different thicknesses granularity according to actual abnormal website accuracy of identification Breath type is as described comparison object.
Similar, described original image specifically refers to the image content directly occurred in certain content domain, described picture Classification, specifically refers to picture classification under certain taxonomic hierarchies.
Currently, it will be appreciated by persons skilled in the art that and can also obtain the comparison pair of other forms in content domain As, it practice, every can the data of clear definition and the page column of identification or page info type all can be as described Comparison object, this is not limited by the present embodiment.
260, according to described comparison object each described history update the page object content territory in frequency of occurrence, calculate The probability of occurrence of described comparison object.
In an object lesson, within one day, website to be verified updates the page corresponding to three history, and history updates The page 1, history update the page 2 and history updates the page 3, and the object content territory chosen is text header territory, the comparison chosen Object is urtext.
Wherein, update the urtext occurred in the text header territory of the page 1 to include in history: text header 1, text mark Topic 2 and text header 3;The urtext occurred in text header territory in history updates the page 2 includes: text header 1, Text header 3 and text header 4;The urtext occurred in text header territory in history updates the page 3 includes: text Title 3 and text header 5.
Accordingly, occurring in that altogether 8 text headers in above three history updates the page, text header 1 is above-mentioned Three history update in the page and occur altogether 2 times, and then may determine that the probability of occurrence corresponding with text header is 2/8;Text mark Topic 2 occurs 1 time in above three history updates the page altogether, and then may determine that the probability of occurrence corresponding with text header is 1/ 8;Text header 3 occurs 3 times in above three history updates the page altogether, and then may determine that the appearance corresponding with text header Probability is 3/8;Text header 4 occurs 1 time in above three history updates the page altogether, and then may determine that and text header pair The probability of occurrence answered is 1/8;Text header 5 occurs 1 time in above three history updates the page altogether, and then may determine that and literary composition The probability of occurrence of this title 5 correspondence is 1/8.
270, according to the probability of occurrence of described comparison object, the comentropy corresponding with described object content territory is calculated.
According to comentropy computing formula, comentropy H that can obtain above-mentioned and described object content territory corresponding is:
H=(1/4) log24+(1/8)log28+(3/8)log23/8+(1/8)log28+(1/8)log28。
280, according to comentropy result of calculation, described website to be verified is carried out anomalous identification.
Inventor is by finding after the feature of the various cheating websites of analysis: if in the multiple history corresponding with same website Updating in the page, the main picture of the page repeats (comentropy of picture is little) in a large number, and picture describes text or text header is fresh See repetition (comentropy that picture describes text or text header is big), then this website has greater probability to be cheating website;Additionally, If the other comentropy of picture category exists notable difference with the comentropy of picture header, then this website also has greater probability to be cheating Website.
Accordingly, one of the present embodiment preferred embodiment in, according to comentropy result of calculation, to described to be verified Website carries out anomalous identification and may include that
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, then Determine that described website to be verified is for abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second thresholding Value, it is determined that described website to be verified is abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value, Then determine that described website to be verified is for abnormal website.
Wherein, described first threshold value, the second threshold value and the 3rd threshold value can be preset according to practical situation, This is not limited by the present embodiment.
The technical scheme of the present embodiment, will be from identical by newly generated in screening certain period or have the page of renewal The page aggregation of website together, and is chosen website to be verified according to polymerization result and is carried out the mode of anomalous identification, compared to will Whole websites that search engine is included all carry out the mode of anomalous identification, on the premise of not dramatically increasing loss, Ke Yi great Reduce greatly amount of calculation;Additionally, by website being carried out anomalous identification according to the comentropy difference of each content domain in a website Mode, it is not necessary to introduce any reference site, only according to the comentropy difference feature in different content territory in website to be verified, Can realize identifying simply, accurately the technique effect of abnormal website.
On the basis of the various embodiments described above, according to described comparison object in each described object content territory appearance frequency Secondary, before calculating the probability of occurrence of described comparison object, it is also possible to including:
If it is determined that described comparison object is ageing simple repeated text, then in each described history updates the page, point The body matter that Huo Qu not associate with described comparison object;If updating in the page in different history, with same target comparison pair As corresponding body matter differs, then it is different comparison objects by described target comparison object tag.
The reason so arranged is: when calculating comentropy, and to having, ageing identic text needs are special Process.Such as, as " one week news flash ", " Domestic Briefs " this headline, the body matter corresponding at different time is different, When calculating comentropy, need to combine body matter and judge.Namely: in history updates the page 1 and the history renewal page 2 All occur in that " one week news flash " this comparison object, if if only adding up the frequency of occurrence of " one week news flash ", then this comparison pair The probability of occurrence of elephant is 1.But, it is contemplated that " one week news flash " is one and has ageing text, also to continue in history more New page 1 and history update the body matter that in the page 2, comparison is corresponding with " one week news flash ", if the two is different, then and can be by History updates " the one week news flash " in the page 1 and " the one week news flash " in the history renewal page 2 is identified as different comparisons pair As, and then may determine that the probability of occurrence of this comparison object is 1/2.
By above-mentioned setting, the accuracy in computation of comentropy can be improved, and then the identification that can improve abnormal website is accurate Exactness.
Embodiment three
Fig. 3 is the flow chart of the recognition methods of a kind of website that the embodiment of the present invention two provides.The present embodiment is with above-mentioned reality It is optimized based on executing example, in the present embodiment, according to comentropy result of calculation, described website to be verified will be carried out exception Identify and be specifically optimized for: according to the data characteristics of described website to be verified, obtain to be verified with described in reliable website list The reference site of website association;Obtain the comentropy of at least one content domain corresponding with described reference site;Described to be tested In card website and described reference site, choose at least one key content territory;According to described website to be verified and described ginseng Examine in website, the comentropy respectively the most corresponding with described key content territory, calculate described website to be verified and described reference site it Between the diversity factor factor;If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is abnormal Website.
Accordingly, the method for the present embodiment specifically includes:
310, in setting the time period, obtain at least two history associated with website to be verified and update the page.
320, each described history is updated the page and carry out Context resolution, obtain corresponding extremely with each described history refresh page face A few content domain.
330, the content change updated in the page in identical content territory according to each described history, calculates each described content domain Comentropy.
340, according to the data characteristics of described website to be verified, obtain and described website to be verified in reliable website list The reference site of association.
In the present embodiment, the data characteristics of described website to be verified can include following at least one: set the time period Interior network upgrade frequency, the new added pages quantity set in the time period and content topic etc..
Wherein, described reliable website list specifically refers to: excavated by User action log or the method such as manual sorting, The a collection of reliable website determined.
In the present embodiment, it is contemplated that renewal frequency new added pages quantity similar, that set in the time period is similar or interior Hold the reliable website that theme is similar, its webpage also can have between the comentropy of each content domain certain similarity.Therefore, By obtaining the reference site similar in described data characteristics with described website to be verified in reliable website list, and lead to Cross the comentropy difference in each territory in described reference site and described website to be verified, abnormal website can be identified.
350, the comentropy of at least one content domain corresponding with described reference site is obtained.
360, in described website to be verified and described reference site, at least one key content territory is chosen.
Wherein it is possible to obtain the full content territory all included in described website to be verified and described reference site as institute State key content territory, it is also possible to obtain one or more important content domain that above-mentioned both of which includes (such as, picture territory with And text header territory etc.) as described key content territory, this is not limited by the present embodiment.
370, according in described website to be verified and described reference site, the most corresponding letter is distinguished with described key content territory Breath entropy, calculates the diversity factor factor between described website to be verified and described reference site.
One of the present embodiment preferred embodiment in, according to described website to be verified and described reference site In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site The different degree factor specifically may include that
In described website to be verified and described reference site, obtain the comentropy corresponding with same key content territory poor Value is as the described diversity factor factor.
Such as, in website to be verified, the comentropy corresponding with key content territory 1 is A, corresponding with key content territory 2 Comentropy is B;In reference site, the comentropy corresponding with key content territory 1 is C, the comentropy corresponding with key content territory 2 For D;
Then can be using | A-C | and | B-D | as the described diversity factor factor.Wherein, | | represent the symbol that takes absolute value.
The present embodiment another preferred embodiment in, according to described website to be verified and described reference site In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site The different degree factor specifically may include that
In described website to be verified, the comentropy the most corresponding with at least two key content territory is constituted the first information Vector;
In described reference site, the comentropy the most corresponding with described at least two key content territory is constituted the second letter Breath vector;
Calculate that the described first information is vectorial and distance value between described second information vector is as the described diversity factor factor.
Continuous precedent, in website to be verified, the comentropy corresponding with key content territory 1 is A, corresponding with key content territory 2 Comentropy be B;In reference site, the comentropy corresponding with key content territory 1 is C, the information corresponding with key content territory 2 Entropy is D;
Then corresponding with website to be verified first information vector is [A, B], second information vector corresponding with reference site For [C, D].
Wherein it is possible to calculate the distance value between two vectors by various modes, typically, calculate both cosine folders The mode at angle, and using calculated distance value as the described diversity factor factor.
380, judge whether the described diversity factor factor meets and set threshold condition, if so, perform 390.Otherwise, perform 3100。
Wherein, if the described diversity factor factor is comentropy difference, if then the described diversity factor factor meets setting Threshold condition, it is determined that described website to be verified specifically may include that for abnormal website
If the comentropy difference setting quantity exceedes setting threshold value, and/or the information corresponding with setting key content territory Entropy difference exceedes setting threshold value, it is determined that described website to be verified is abnormal website;Or
If being weighted at least two comentropy difference suing for peace, the difference accumulated value obtained exceedes setting threshold value, the most really Fixed described website to be verified is abnormal website.
If the described diversity factor factor is described distance value, if then the described diversity factor factor meets setting threshold value bar Part, it is determined that described website to be verified specifically may include that for abnormal website
If described distance value exceedes setting threshold value, it is determined that described website to be verified is abnormal website.
390, determine that described website to be verified is for abnormal website.
3100, determine that described website to be verified is normal website.
The technical scheme of the present embodiment is by after the comentropy of each content domain, obtaining in being calculated website to be verified With the comentropy of each content domain in the reliable website of this website data feature similarity to be verified, comentropies based on both, calculate Obtain both diversity factor factors, and then website to be verified is carried out the technological means of anomalous identification, it is possible to achieve according to exception Comentropy difference between website and reliable website, simply, quickly identifies the technique effect of abnormal website, recognition accuracy Height, real-time is good.
Embodiment four
Fig. 4 is the structure chart identifying device of a kind of website that the embodiment of the present invention four provides.As shown in Figure 4, described dress Put and include: history updates page acquisition module 41, content domain acquisition module 42, content domain comentropy computing module 43 and exception Identification module 44, wherein:
History updates page acquisition module 41, in setting the time period, obtains and associates at least with website to be verified Two history update the page.
Content domain acquisition module 42, carries out Context resolution for each described history is updated the page, obtains and go through described in each At least one content domain that history refresh page face is corresponding.
Content domain comentropy computing module 43, for the content updated in the page in identical content territory according to each described history Change, calculates the comentropy of each described content domain.
Anomalous identification module 44, for according to comentropy result of calculation, carries out anomalous identification to described website to be verified.
The embodiment of the present invention, by setting the time period, obtains at least two history associated with website to be verified and updates The page;Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face Content domain;The content change updated in the page in identical content territory according to each described history, calculates the information of each described content domain Entropy;According to comentropy result of calculation, described website to be verified is carried out the technological means of anomalous identification, due to Information Entropy Features Discrimination is good, calculates height simple, ageing, can solve the discrimination that existing cheating website identification technology brought the highest, real Time property is poor and needs to introduce extra artificial mark or the technical problem of data compilation work, optimizes existing website and knows Other technology, improves the recognition accuracy of abnormal website.
On the basis of the various embodiments described above, it is permissible that at least two history that website described and to be verified associates updates the page Including:
At least two history corresponding with the website domain name of described website to be verified updates the page;And/or
At least two history corresponding with the same web page address in described website to be verified updates the page.
On the basis of the various embodiments described above, described history updates page acquisition module, specifically may be used for:
In setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal;
After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described to be verified Website;
The page included according to described clustering cluster, obtains at least two history associated with described website to be verified and updates The page.
On the basis of the various embodiments described above, described content domain can include following at least one:
Text header territory, picture territory, picture header territory, picture describe textview field.
On the basis of the various embodiments described above, described content domain comentropy computing module, specifically may be used for:
Update in each described history respectively in the same target content domain of the page, extract at least one comparison object;
According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate described The probability of occurrence of comparison object;
According to the probability of occurrence of described comparison object, calculate the comentropy corresponding with described object content territory.
On the basis of the various embodiments described above, if the content in described object content territory includes text, the most described comparison Object may include that urtext, semantic signature or semantic category;
If the content in described object content territory includes picture, the most described comparison object may include that original image or Person's picture classification.
On the basis of the various embodiments described above, it is also possible to including: body matter association comparing module, it is used for:
According to described comparison object frequency of occurrence in each described object content territory, calculate going out of described comparison object Before existing probability, if it is determined that described comparison object is ageing simple repeated text, then in each described history updates the page, Obtain the body matter associated with described comparison object respectively;
If updated in the page in different history, the body matter corresponding with same target comparison object differs, then will Described target comparison object tag is different comparison object.
On the basis of the various embodiments described above, described anomalous identification module, specifically may include that
Reference site acquiring unit, for the data characteristics according to described website to be verified, obtains in reliable website list Take the reference site associated with described website to be verified;
Reference site comentropy acquiring unit, for obtaining the letter of at least one content domain corresponding with described reference site Breath entropy;
Unit is chosen in key content territory, in described website to be verified and described reference site, chooses at least one Individual key content territory;
Diversity factor factor calculating unit, for according in described website to be verified and described reference site, with described pass The comentropy that key content domain is the most corresponding, calculates the diversity factor factor between described website to be verified and described reference site;
Abnormal website identifies subelement, sets threshold condition if met for the described diversity factor factor, it is determined that described Website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically may be used for:
In described website to be verified and described reference site, obtain the comentropy corresponding with same key content territory poor Value is as the described diversity factor factor;
Abnormal website identifies that subelement specifically may be used for:
If the comentropy difference setting quantity exceedes setting threshold value, and/or the information corresponding with setting key content territory Entropy difference exceedes setting threshold value, it is determined that described website to be verified is abnormal website;Or
If being weighted at least two comentropy difference suing for peace, the difference accumulated value obtained exceedes setting threshold value, the most really Fixed described website to be verified is abnormal website.
On the basis of the various embodiments described above, diversity factor factor calculating unit specifically may be used for:
In described website to be verified, the comentropy the most corresponding with at least two key content territory is constituted the first information Vector;
In described reference site, the comentropy the most corresponding with described at least two key content territory is constituted the second letter Breath vector;
Calculate that the described first information is vectorial and distance value between described second information vector is as the described diversity factor factor;
Described abnormal website identifies that subelement specifically may be used for:
If described distance value exceedes setting threshold value, it is determined that described website to be verified is abnormal website.
On the basis of the various embodiments described above, the data characteristics of described website to be verified can include following at least one:
Set the network upgrade frequency in the time period, the new added pages quantity set in the time period and content topic.
On the basis of the various embodiments described above, described anomalous identification module, specifically may include that
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, then Determine that described website to be verified is for abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second thresholding Value, it is determined that described website to be verified is abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value, Then determine that described website to be verified is for abnormal website.
On the basis of the various embodiments described above, described anomalous identification module, specifically may include that
Using described comentropy result of calculation as at least one Information Entropy Features value, by described Information Entropy Features value and other Abnormal website identifies that eigenvalue is combined, and described website to be verified is carried out anomalous identification.
The identification device of the website that the embodiment of the present invention is provided can be used for performing the net that any embodiment of the present invention provides The recognition methods stood, possesses corresponding functional module, it is achieved identical beneficial effect.
Obviously, it will be understood by those skilled in the art that each module or each step of the above-mentioned present invention can be by as above Described server implementation.Alternatively, the embodiment of the present invention can realize by the executable program of computer installation, thus can Performing by processor to be stored in storing in device, described program can be stored in a kind of computer-readable storage In medium, storage medium mentioned above can be read only memory, disk or CD etc.;Or they are fabricated to respectively each Individual integrated circuit modules, or the multiple modules in them or step are fabricated to single integrated circuit module realize.So, The present invention is not restricted to the combination of any specific hardware and software.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for those skilled in the art For, the present invention can have various change and change.All made within spirit and principles of the present invention any amendment, equivalent Replacement, improvement etc., should be included within the scope of the present invention.

Claims (20)

1. the recognition methods of a website, it is characterised in that including:
In setting the time period, obtain at least two history associated with website to be verified and update the page;
Each described history is updated the page and carries out Context resolution, obtain at least one corresponding with each described history refresh page face Hold territory;
The content change updated in the page in identical content territory according to each described history, calculates the comentropy of each described content domain;
According to comentropy result of calculation, described website to be verified is carried out anomalous identification.
Method the most according to claim 1, it is characterised in that at least two history that website described and to be verified associates is more New page includes:
At least two history corresponding with the website domain name of described website to be verified updates the page;And/or
At least two history corresponding with the same web page address in described website to be verified updates the page.
Method the most according to claim 1 and 2, it is characterised in that in setting the time period, obtains and closes with website to be verified At least two history refresh page face of connection includes:
In setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal;
After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described net to be verified Stand;
The page included according to described clustering cluster, obtains at least two history refresh page associated with described website to be verified Face.
Method the most according to claim 1, it is characterised in that described content domain include following at least one:
Text header territory, picture territory, picture header territory, picture describe textview field.
Method the most according to claim 1, it is characterised in that according in identical content territory in each described history renewal page Content change, the comentropy calculating each described content domain includes:
Update in each described history respectively in the same target content domain of the page, extract at least one comparison object;
According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate described comparison The probability of occurrence of object;
According to the probability of occurrence of described comparison object, calculate the comentropy corresponding with described object content territory.
Method the most according to claim 5, it is characterised in that:
If the content in described object content territory includes text, the most described comparison object includes: urtext, semantic signature or Person's semantic category;
If the content in described object content territory includes picture, the most described comparison object includes: original image or picture category Not.
7. according to the method described in claim 5 or 6, it is characterised in that according to described comparison object in each described target Hold the frequency of occurrence in territory, before calculating the probability of occurrence of described comparison object, also include:
If it is determined that described comparison object is ageing simple repeated text, then, in each described history updates the page, obtain respectively Take the body matter associated with described comparison object;
If updated in the page in different history, the body matter corresponding with same target comparison object differs, then by described Target comparison object tag is different comparison object.
Method the most according to claim 1, it is characterised in that according to comentropy result of calculation, to described website to be verified Carry out anomalous identification to include:
According to the data characteristics of described website to be verified, reliable website list obtains the ginseng associated with described website to be verified Examine website;
Obtain the comentropy of at least one content domain corresponding with described reference site;
In described website to be verified and described reference site, choose at least one key content territory;
According in described website to be verified and described reference site, distinguish the most corresponding comentropy, meter with described key content territory Calculate the diversity factor factor between described website to be verified and described reference site;
If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is abnormal website.
Method the most according to claim 8, it is characterised in that according to described website to be verified and described reference site In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site The different degree factor specifically includes:
In described website to be verified and described reference site, obtain the comentropy difference corresponding with same key content territory and make For the described diversity factor factor;
If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is that abnormal website is concrete Including:
Comentropy if the comentropy difference setting quantity exceedes setting threshold value and/or corresponding with setting key content territory is poor Value exceedes setting threshold value, it is determined that described website to be verified is abnormal website;Or
If being weighted at least two comentropy difference suing for peace, the difference accumulated value obtained exceedes setting threshold value, it is determined that institute State website to be verified for abnormal website.
Method the most according to claim 8, it is characterised in that according to described website to be verified and described reference site In, the comentropy the most corresponding with described key content territory, calculate the difference between described website to be verified and described reference site The different degree factor specifically includes:
In described website to be verified, by the comentropy composition first information respectively the most corresponding with at least two key content territory to Amount;
In described reference site, the comentropy the most corresponding with described at least two key content territory is constituted the second information to Amount;
Calculate that the described first information is vectorial and distance value between described second information vector is as the described diversity factor factor;
If the described diversity factor factor meets sets threshold condition, it is determined that described website to be verified is that abnormal website is concrete Including:
If described distance value exceedes setting threshold value, it is determined that described website to be verified is abnormal website.
11. methods described in-10 any one according to Claim 8, it is characterised in that the data characteristics bag of described website to be verified Include following at least one:
Set the network upgrade frequency in the time period, the new added pages quantity set in the time period and content topic.
12. methods according to claim 1, it is characterised in that according to comentropy result of calculation, to described website to be verified Carry out anomalous identification to include:
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, it is determined that Described website to be verified is abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second threshold value, then Determine that described website to be verified is for abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value, the most really Fixed described website to be verified is abnormal website.
13. methods according to claim 1, it is characterised in that according to comentropy result of calculation, to described website to be verified Carry out anomalous identification to include:
Using described comentropy result of calculation as at least one Information Entropy Features value, described Information Entropy Features value is abnormal with other Website identifies that eigenvalue is combined, and described website to be verified is carried out anomalous identification.
The identification device of 14. 1 kinds of websites, it is characterised in that including:
History updates page acquisition module, in setting the time period, obtains at least two associated with website to be verified and goes through History updates the page;
Content domain acquisition module, carries out Context resolution for each described history is updated the page, obtains and updates with each described history At least one content domain that the page is corresponding;
Content domain comentropy computing module, for the content change updated in the page in identical content territory according to each described history, Calculate the comentropy of each described content domain;
Anomalous identification module, for according to comentropy result of calculation, carries out anomalous identification to described website to be verified.
15. devices according to claim 14, it is characterised in that described history updates page acquisition module, specifically for:
In setting the time period, capture in network newly generated by web crawlers, and/or have the page of renewal;
After the page of crawl is clustered according to website domain name, using the website corresponding with clustering cluster as described net to be verified Stand;
The page included according to described clustering cluster, obtains at least two history refresh page associated with described website to be verified Face.
16. devices according to claim 14, it is characterised in that described content domain comentropy computing module, specifically for:
Update in each described history respectively in the same target content domain of the page, extract at least one comparison object;
According to described comparison object frequency of occurrence in each described history updates the object content territory of the page, calculate described comparison The probability of occurrence of object;
According to the probability of occurrence of described comparison object, calculate the comentropy corresponding with described object content territory.
17. devices according to claim 16, it is characterised in that also include: body matter association comparing module, are used for:
According to described comparison object frequency of occurrence in each described object content territory, the appearance calculating described comparison object is general Before rate, if it is determined that described comparison object is ageing simple repeated text, then in each described history updates the page, respectively Obtain the body matter associated with described comparison object;
If updated in the page in different history, the body matter corresponding with same target comparison object differs, then by described Target comparison object tag is different comparison object.
18. devices according to claim 14, it is characterised in that described anomalous identification module, specifically include:
Reference site acquiring unit, for according to the data characteristics of described website to be verified, obtain in reliable website list with The reference site of described website to be verified association;
Reference site comentropy acquiring unit, for obtaining the information of at least one content domain corresponding with described reference site Entropy;
Unit is chosen in key content territory, in described website to be verified and described reference site, chooses at least one and closes Key content domain;
Diversity factor factor calculating unit, for according in described website to be verified and described reference site, and in described key Hold the comentropy that territory is the most corresponding, calculate the diversity factor factor between described website to be verified and described reference site;
Abnormal website identifies subelement, sets threshold condition if met for the described diversity factor factor, it is determined that described to be tested Card website is abnormal website.
19. devices according to claim 14, it is characterised in that described anomalous identification module, specifically include:
If the summation of the comentropy of each content domain corresponding with described website to be verified is less than setting the first threshold value, it is determined that Described website to be verified is abnormal website;Or
If the comentropy at least one the object content territory corresponding with described website to be verified is less than setting the second threshold value, then Determine that described website to be verified is for abnormal website;Or
If the ratio at least two object content territory corresponding with described website to be verified is less than setting the 3rd threshold value, the most really Fixed described website to be verified is abnormal website.
20. devices according to claim 14, it is characterised in that described anomalous identification module, specifically include:
Using described comentropy result of calculation as at least one Information Entropy Features value, described Information Entropy Features value is abnormal with other Website identifies that eigenvalue is combined, and described website to be verified is carried out anomalous identification.
CN201610571258.6A 2016-07-19 2016-07-19 The recognition methods of website and device Active CN106294535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610571258.6A CN106294535B (en) 2016-07-19 2016-07-19 The recognition methods of website and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610571258.6A CN106294535B (en) 2016-07-19 2016-07-19 The recognition methods of website and device

Publications (2)

Publication Number Publication Date
CN106294535A true CN106294535A (en) 2017-01-04
CN106294535B CN106294535B (en) 2019-06-25

Family

ID=57651792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610571258.6A Active CN106294535B (en) 2016-07-19 2016-07-19 The recognition methods of website and device

Country Status (1)

Country Link
CN (1) CN106294535B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451180A (en) * 2017-06-13 2017-12-08 百度在线网络技术(北京)有限公司 Identify method, apparatus, equipment and the computer-readable storage medium of website affinity
CN108280110A (en) * 2017-05-15 2018-07-13 广州市动景计算机科技有限公司 Website contrast difference's method, apparatus and client
CN109150817A (en) * 2017-11-24 2019-01-04 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN109800378A (en) * 2019-01-23 2019-05-24 北京字节跳动网络技术有限公司 Content processing method, device and electronic equipment based on custom browser
CN109818828A (en) * 2019-02-20 2019-05-28 成都嗨翻屋科技有限公司 A kind of distributed reptile system monitoring method and device
CN110716778A (en) * 2019-09-10 2020-01-21 阿里巴巴集团控股有限公司 Application compatibility testing method, device and system
CN111460763A (en) * 2020-03-02 2020-07-28 南京南瑞继保电气有限公司 Method, device and equipment for marking file differences and computer-readable storage medium
CN113554131A (en) * 2021-09-22 2021-10-26 四川大学华西医院 Medical image processing and analysis method, computer equipment, system and storage medium
CN113841156A (en) * 2019-05-27 2021-12-24 西门子股份公司 Control method and device based on image recognition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 A sensitive web page filtering method and system based on multi-classifier fusion
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN105205061A (en) * 2014-06-12 2015-12-30 中国银联股份有限公司 Method for acquiring page information of E-commerce website

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 A sensitive web page filtering method and system based on multi-classifier fusion
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN105205061A (en) * 2014-06-12 2015-12-30 中国银联股份有限公司 Method for acquiring page information of E-commerce website

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280110A (en) * 2017-05-15 2018-07-13 广州市动景计算机科技有限公司 Website contrast difference's method, apparatus and client
CN107451180A (en) * 2017-06-13 2017-12-08 百度在线网络技术(北京)有限公司 Identify method, apparatus, equipment and the computer-readable storage medium of website affinity
CN109150817A (en) * 2017-11-24 2019-01-04 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN109150817B (en) * 2017-11-24 2020-11-27 新华三信息安全技术有限公司 Webpage request identification method and device
CN109800378A (en) * 2019-01-23 2019-05-24 北京字节跳动网络技术有限公司 Content processing method, device and electronic equipment based on custom browser
CN109818828A (en) * 2019-02-20 2019-05-28 成都嗨翻屋科技有限公司 A kind of distributed reptile system monitoring method and device
CN113841156A (en) * 2019-05-27 2021-12-24 西门子股份公司 Control method and device based on image recognition
CN110716778A (en) * 2019-09-10 2020-01-21 阿里巴巴集团控股有限公司 Application compatibility testing method, device and system
CN110716778B (en) * 2019-09-10 2023-09-26 创新先进技术有限公司 Application compatibility testing method, device and system
CN111460763A (en) * 2020-03-02 2020-07-28 南京南瑞继保电气有限公司 Method, device and equipment for marking file differences and computer-readable storage medium
CN113554131A (en) * 2021-09-22 2021-10-26 四川大学华西医院 Medical image processing and analysis method, computer equipment, system and storage medium

Also Published As

Publication number Publication date
CN106294535B (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN106294535A (en) The recognition methods of website and device
US7685201B2 (en) Person disambiguation using name entity extraction-based clustering
Sun et al. Dom based content extraction via text density
WO2021159632A1 (en) Intelligent questioning and answering method and apparatus, computer device, and computer storage medium
US9449271B2 (en) Classifying resources using a deep network
US8346701B2 (en) Answer ranking in community question-answering sites
CN111813905B (en) Corpus generation method, corpus generation device, computer equipment and storage medium
CN108763321B (en) Related entity recommendation method based on large-scale related entity network
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
US20200004792A1 (en) Automated website data collection method
AU2005203239A1 (en) Phrase-based indexing in an information retrieval system
US20100211533A1 (en) Extracting structured data from web forums
Reinanda et al. Document filtering for long-tail entities
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN104715063A (en) Search ranking method and search ranking device
CN106599215A (en) Question generation method and question generation system based on deep learning
CN104881428A (en) Information graph extracting and retrieving method and device for information graph webpages
CN107391638A (en) The new ideas of rule-associated model find method and device
Kang et al. Learning to re-rank web search results with multiple pairwise features
Saha et al. A large scale study of SVM based methods for abstract screening in systematic reviews
CN106293114B (en) Predict the method and device of user's word to be entered
Zhe et al. An adaptive topic tracking approach based on single-pass clustering with sliding time window
CN116561402A (en) Method, device and server for acquiring target content information in webpage
JP2010282403A (en) Document retrieval method
Mukherjee et al. Browsing fatigue in handhelds: semantic bookmarking spells relief

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant