CN105653550B - Webpage filtering method and device - Google Patents
Webpage filtering method and device Download PDFInfo
- Publication number
- CN105653550B CN105653550B CN201410648193.1A CN201410648193A CN105653550B CN 105653550 B CN105653550 B CN 105653550B CN 201410648193 A CN201410648193 A CN 201410648193A CN 105653550 B CN105653550 B CN 105653550B
- Authority
- CN
- China
- Prior art keywords
- node
- webpage
- nodes
- filtering
- webpages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 97
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000010586 diagram Methods 0.000 description 8
- 239000013598 vector Substances 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of Webpage filtering method and devices, belong to Internet technical field.Include multiple webpages in the collections of web pages the described method includes: obtaining collections of web pages to be analyzed, includes multiple nodes in each webpage;For each node in each webpage, a possibility that calculating the node characteristic value, the possibility characteristic value is for indicating a possibility that node is specified type node size;Will likely property characteristic value be greater than specified threshold node be determined as the specified type node;Based on fixed specified type node, treats displayed web page and be filtered.The present invention is by calculating in collections of web pages in each webpage characteristic value a possibility that each node, will likely property characteristic value be greater than specified threshold node as specified type node, fixed specified type node can be directly based upon, displayed web page is treated to be filtered, without human configuration filtering profile, it is simple and efficient to handle, save time cost and human cost.
Description
Technical Field
The invention relates to the technical field of internet, in particular to a webpage filtering method and device.
Background
With the popularization of the internet, many manufacturers can publish advertisements in web pages to promote products produced by the manufacturers, which results in various advertisements included in the web pages and seriously affects users to browse the web pages normally.
In order to filter out advertisements in web pages, website operators can manually configure a filtering template according to the advertisements in each web page and upload the filtering template to a website server, and the website server can filter the web pages according to the filtering template. The filtering template can be a blacklist or a white list, when the filtering template is the blacklist, the website server extracts webpage content matched with the filtering template in the webpage and filters the extracted webpage content, and when the filtering template is the white list, the website server extracts the webpage content matched with the filtering template in the webpage and filters other webpage content in the webpage.
In the process of implementing the invention, the inventor finds that the prior art has at least the following defects: when the filtering templates are configured for massive webpages, excessive labor cost is consumed.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for web page filtering. The technical scheme is as follows:
in a first aspect, a method for filtering a web page is provided, where the method includes:
acquiring a webpage set to be analyzed, wherein the webpage set comprises a plurality of webpages, and each webpage comprises a plurality of nodes;
for each node in each webpage, calculating a possibility characteristic value of the node, wherein the possibility characteristic value is used for representing the possibility size of the node being a specified type node;
determining nodes with the possibility characteristic values larger than a specified threshold value as the specified type nodes;
and filtering the webpage to be displayed based on the determined specified type node.
In a second aspect, an apparatus for filtering web pages is provided, the apparatus comprising:
the system comprises a webpage set acquisition module, a webpage analysis module and a webpage analysis module, wherein the webpage set acquisition module is used for acquiring a webpage set to be analyzed, the webpage set comprises a plurality of webpages, and each webpage comprises a plurality of nodes;
the calculation module is used for calculating a possibility characteristic value of each node in each webpage, and the possibility characteristic value is used for representing the possibility of the node being a node of a specified type;
a designated type node determining module, configured to determine a node with a likelihood feature value greater than a designated threshold as the designated type node;
and the filtering module is used for filtering the webpage to be displayed based on the determined specified type node.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
according to the method and the device provided by the embodiment of the invention, the possibility characteristic value of each node in each webpage in the webpage set is calculated, the node with the possibility characteristic value larger than the designated threshold value is used as the designated type node, the webpage to be displayed can be filtered directly based on the determined designated type node, a filtering template does not need to be configured manually, the operation is simple, convenient and fast, and the time cost and the labor cost are saved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a web page filtering method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a web page filtering method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a web page provided by an embodiment of the invention;
FIG. 4 is a diagram illustrating a tree structure according to an embodiment of the present invention;
FIG. 5 is a flow chart of a calculation of a likelihood feature value provided by an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a web page filtering apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a web page filtering method according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a server, and referring to fig. 1, the method includes:
101. the method comprises the steps of obtaining a webpage set to be analyzed, wherein the webpage set comprises a plurality of webpages, and each webpage comprises a plurality of nodes.
102. For each node in each web page, a likelihood feature value for the node is calculated, the likelihood feature value being indicative of a likelihood size that the node is a node of a specified type.
103. And determining the node with the possibility characteristic value larger than the specified threshold value as the specified type node.
104. And filtering the webpage to be displayed based on the determined specified type node.
According to the method provided by the embodiment of the invention, the possibility characteristic value of each node in each webpage in the webpage set is calculated, the node with the possibility characteristic value larger than the designated threshold value is used as the designated type node, the webpage to be displayed can be filtered directly based on the determined designated type node, a filtering template does not need to be manually configured, the operation is simple, convenient and fast, and the time cost and the labor cost are saved.
Optionally, the calculating, for each node in each web page, a likelihood feature value of the node includes:
according to the content of each node, calculating the similarity between the node and each node in other webpages except the webpage in the webpage set;
and carrying out statistics on the similarity of the node and each node in other webpages to obtain the probability characteristic value of the node.
Optionally, the method further comprises:
and grouping the multiple nodes in the multiple webpages according to the position of each node in the corresponding webpage to obtain multiple node sets, wherein the multiple nodes in each node set are positioned at the same position in different webpages.
Optionally, the calculating, for each node in each web page, a likelihood feature value of the node includes:
for each node in each node set, calculating the similarity between the node and other nodes in the node set according to the content of each node;
and carrying out statistics on the similarity of the node and other nodes in the node set to obtain the probability characteristic value of the node.
Optionally, the acquiring the set of web pages to be analyzed includes:
acquiring a plurality of webpages generated within a specified duration before a current time point;
and grouping the multiple webpages to obtain multiple webpage sets.
Optionally, the grouping the multiple webpages to obtain multiple webpage sets includes:
grouping the multiple webpages according to the issuing account number of each webpage to obtain multiple webpage sets; or,
grouping the multiple webpages according to the storage directory of each webpage to obtain multiple webpage sets; or,
and grouping the multiple webpages according to the subdomain name of each webpage to obtain multiple webpage sets.
Optionally, the filtering, based on the determined node of the specified type, the to-be-displayed web page includes:
outputting the determined appointed type node to a blacklist template configuration file;
when a webpage filtering and displaying request is received, acquiring an original webpage corresponding to the webpage filtering and displaying request;
and filtering the original webpage based on the blacklist template configuration file so as to filter out the specified type nodes included in the original webpage.
Optionally, the filtering, based on the determined node of the specified type, the to-be-displayed web page includes:
outputting nodes except the nodes of the specified type in the multiple webpages to a white list template configuration file;
when a webpage filtering and displaying request is received, acquiring an original webpage corresponding to the webpage filtering and displaying request;
and filtering the original webpage based on the white list template configuration file so as to filter out the specified type nodes included in the original webpage.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
Fig. 2 is a flowchart of a web page filtering method according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a server, and referring to fig. 2, the method includes:
201. the server groups a plurality of webpages to be analyzed to obtain a plurality of webpage sets.
In the embodiment of the present invention, the server is configured to provide a webpage for a terminal, where the terminal may be a fixed terminal or a mobile terminal, such as a computer, a mobile phone, and the like. When a user wants to browse a webpage, the operation of accessing the webpage can be triggered on the terminal, and when the terminal acquires the operation of accessing the webpage, a webpage display request is sent to the server, wherein the webpage display request carries a webpage address. When the server receives the webpage display request, an original webpage corresponding to the webpage display request can be obtained according to the webpage address, if the terminal is a fixed terminal, the server sends the original webpage to the fixed terminal, the fixed terminal can display the original webpage, if the terminal is a mobile terminal, the server transcodes the original webpage, the transcoded webpage is sent to the mobile terminal, and the mobile terminal can display the transcoded webpage.
In practical applications, the original web page may include content such as advertisements, instructions for use, recommendation information, spam, etc., which is not related to the content of the web page itself but is liable to affect browsing of users, and many users wish to filter out the content when browsing the web page. In order to meet the requirements of the user, the server may determine the content to be filtered in the webpage to be displayed before sending the webpage to be displayed to the terminal each time, so as to filter the webpage to be displayed. In order to determine the content to be filtered in the web page to be displayed, the server may train a plurality of web pages to identify the content to be filtered in each web page.
Further, in order to improve the training accuracy, the server may group a plurality of web pages to obtain a plurality of web page sets, and train each web page set. Specifically, the server may group all the webpages, may select a plurality of sample webpages from all the webpages, group the plurality of sample webpages, may also obtain a webpage snapshot of each webpage, and group the plurality of obtained webpage snapshots, which is not limited in the embodiment of the present invention.
Optionally, the server groups the multiple webpages according to a specified rule to obtain multiple webpage sets. The specified rule may be a publishing account number, a storage directory, a subdomain name, or the like of the web page, which is not limited in the embodiment of the present invention. The server comprises webpages issued by a plurality of account numbers, and when the specified rule is the issuing account number of the webpage, the server groups the plurality of webpages according to the issuing account number of each webpage to obtain a plurality of webpage sets, wherein the issuing account numbers of the webpages in the same webpage set are the same, and the issuing account numbers of the webpages in different webpage sets are different. The server stores a plurality of webpages in different storage directories, and when the specified rule is the storage directory of the webpages, the server groups the plurality of webpages according to the storage directory of each webpage to obtain a plurality of webpage sets, wherein the storage directories of the webpages in the same webpage set are the same, and the storage directories of the webpages in different webpage sets are different. The server generates a corresponding webpage address for each webpage, the webpage address comprises a subdomain name, and when the designated rule is the subdomain name of the webpage, the server groups the plurality of webpages according to the subdomain name of each webpage to obtain a plurality of webpage sets, the subdomain names of the webpages in the same webpage set are the same, and the subdomain names of the webpages in different webpage sets are different. In an actual application process, the server may also group the multiple webpages by using other specified rules, which is not limited in the embodiment of the present invention.
In the embodiment of the present invention, different web page sets belong to different groups, and subsequently, when the server acquires a web page to be displayed, the server may divide the web page to be displayed according to the specified rule, and determine a web page set that belongs to the same group as the web page to be displayed, so as to determine content to be filtered in the web page to be displayed according to a training result in the web page set. For example, when the server acquires a webpage to be displayed, an issuing account of the webpage to be displayed is acquired, and a webpage set corresponding to the issuing account is determined, that is, the webpage set belongs to the same group as the webpage to be displayed.
202. For each node in each web page in each set of web pages, the server calculates a likelihood feature value for the node, the likelihood feature value being indicative of a likelihood size that the node is a node of a specified type.
The server can divide the webpage into a plurality of nodes, and the plurality of nodes can comprise nodes in various formats such as text nodes, picture nodes, video nodes, webpage link address nodes and the like. Specifically, the server may divide text content in a web page into a plurality of text nodes according to paragraphs, use each picture in the web page as a picture node, use each video in the web page as a video node, and use each web page link address in the web page as a web page link address node.
Wherein the content of some nodes is the content of the web page itself, and the content of some nodes is unrelated to the content of the web page. And taking the node with the content irrelevant to the content of the webpage where the content is located as the specified type node, wherein the specified type node is the node to be filtered in the webpage.
For each set of web pages, to filter out the nodes of the specified type in the web pages, the server analyzes each web page in the set of web pages to find out the nodes most likely to be the nodes of the specified type. Specifically, for each node in each web page in the web page set, the server calculates a likelihood feature value of the node, where the likelihood feature value is used to indicate the likelihood size that the node is a node of a specified type, that is, the greater the likelihood feature value of the node is, the more likely the node is to be a node of a specified type, and the smaller the likelihood feature value of the node is, the less likely the node is to be a node of a specified type.
In practical applications, the content of the specified type node included in different web pages in the same web page set is often the same or similar. For example, fig. 3 is a schematic diagram of a web page provided in an embodiment of the present invention, where the web page includes two web pages published by the same account, and the two web pages include two different articles: article 1 and article 2, but both above and below the two web pages include nodes of the same content, which are likely to be nodes of a specified type.
Based on the above features, for each node, the more nodes similar to the node that are included in the set of web pages, the more likely the node is to be a designated type node, and the less nodes similar to the node that are included in the set of web pages, the less likely the node is to be a designated type node.
For each node in each webpage, the server can calculate the similarity between the node and each node in other webpages except the webpage where the node is located in the webpage set according to the content of each node, so that a plurality of similarities between the node and a plurality of nodes can be obtained, and the server performs statistics on the calculated similarities to obtain a probability characteristic value of the node, wherein the probability characteristic value can be used for representing the possibility that the node is a node of a specified type. When the plurality of similarities are counted, the server may calculate a sum or an average of the plurality of similarities as a possible characteristic value of the node, which is not limited in the embodiment of the present invention.
Referring to table 1, the web page set includes a web page a and a web page B, where the web page a includes a node 1, and the web page B includes a node 2 and a node 3, and for the node 1, the similarity between the node 1 and the node 2 and the similarity between the node 1 and the node 3 are calculated, and an average value of the two calculated similarities is used as the similarity of the node 1.
TABLE 1
| Web page | Node point |
| Web page A | Node 1 |
| Web page B | Node 2 and node 3 |
Further, for the text node, the server may preset a correspondence between the content of the node and the feature value, for example, a feature value corresponding to each word in the text node, determine a plurality of feature values corresponding to each text node according to the correspondence, and combine the obtained plurality of feature values into a feature vector, that is, obtain the feature vector of each text node. For the picture nodes or the web page link address nodes, the server may preset a corresponding relationship between a URL (Uniform resource locator) and the feature vector, and then the server obtains the URL of each picture node or web page link address node, and determines the feature vector of each picture node or web page link address node according to the corresponding relationship. For each node in each web page, the server may calculate the similarity between the feature vector of the node and the feature vectors of each node in other web pages, to obtain a plurality of similarities. The server may calculate cosine similarity or euclidean distance similarity between the feature vector of the node and the feature vector of each node in other web pages, which is not limited in the embodiments of the present invention.
In practical applications, the positions of the specified type nodes included in different web pages in the same web page set are often the same or similar in the corresponding web pages, for example, the web server may add an advertisement node in the lower right corner of each web page of the web page. Based on this feature, for each node, in order to reduce the amount of computation, the server only computes the similarity of the node with the same position node in other web pages. Specifically, the server groups a plurality of nodes in the plurality of webpages according to the position of each node in the corresponding webpage to obtain a plurality of node sets, wherein the plurality of nodes in each node set are respectively located at the same position in different webpages. And for each node in each node set, calculating the similarity between the node and other nodes in the node set according to the content of each node, and counting the similarity between the node and other nodes in the node set to obtain the probability characteristic value of the node.
Based on the example of table 1, assuming that the position of node 3 in web page B is the same as the position of node 1 in web page a, the server calculates the similarity between node 1 and node 3 as the likelihood characteristic value of node 1.
Optionally, the server may analyze each web page in the web page set to establish a specified tree structure of each web page, where the specified tree structure includes a plurality of nodes, and the server may calculate a likelihood feature value of each node based on the specified tree structure. The specified tree structure may be a dom (document Object model) tree structure or other tree structures, which is not limited in the embodiments of the present invention.
In the specified tree structure, the plurality of nodes have a hierarchical relationship, each node having one node at an upper level and possibly a plurality of nodes at a lower level. For example, a segment of text nodes in a web page may include multiple lines of text nodes.
Taking the calculation of the likelihood eigenvalue of the first node of the first web page as an example, the second web page is any web page in the web page set except the first web page, and for each node in the second web page, when the first node is similar to the node, the first node may also be similar to the node on the previous layer.
Fig. 4 is a schematic diagram of a specified tree structure of a second web page provided by an embodiment of the present invention, fig. 5 is a flowchart of calculating a likelihood feature value provided by an embodiment of the present invention, and referring to fig. 4 and fig. 5, when the server calculates the likelihood feature value of the first node, the server may perform the following steps (1) to (9):
(1) the server selects the node 111 in the lowest layer of the specified tree structure of the second web page.
(2) The server calculates the first similarity between the first node and the node 111, judges whether the first similarity is larger than a first threshold value, if so, executes the step (4), and if not, executes the step (3).
In the embodiment of the present invention, when the first similarity is greater than the first threshold, it indicates that the first node is similar to the node 111, and when the first similarity is not greater than the first threshold, it indicates that the first node is not similar to the node 111. The first threshold may be predetermined by a technician, or determined by the server through statistics of similarity between the first node and each lowest node, which is not limited in the embodiment of the present invention.
(3) The server selects another lowest level node 112 and continues to perform step (2) until each lowest level node is selected.
(4) The server selects node 11, which is one level above node 111.
(5) The server calculates a second similarity between the first node and the node 11, judges whether the second similarity is larger than the first threshold, if so, executes the step (8), and if not, executes the step (6).
(6) The server takes the first similarity as the similarity to be counted.
When the first similarity is greater than the first threshold and the second similarity is not greater than the first threshold, it may be determined that the first node is similar to node 111 and not similar to node 11, and the server determines to select the first similarity as a similarity for subsequent statistics of the likelihood characteristic value of the first node.
(7) The server selects a node 121 located in a different branch from the node 11 from the lowest node of the designated tree structure, and proceeds to step (2).
(8) The server selects the node 1 located at the upper layer of the node 11, and continues to execute the step (5) until the node at the uppermost layer is selected.
(9) And for each webpage except the first webpage in the webpage set, the server repeatedly executes the steps, and when the similarity to be counted corresponding to each webpage is obtained, the obtained multiple similarities are counted to obtain the possibility characteristic value of the first node.
The above steps (1) - (9) are only exemplary steps for the server to calculate the likelihood characteristic value, and in practical application, the server may also determine the maximum node similar to the first node in each web page in other manners, and obtain the similarity to be counted corresponding to each web page, so as to calculate the likelihood characteristic value.
203. The server determines the node with the possibility characteristic value larger than a specified threshold value in the webpage set as the node with the specified type.
The designated threshold may be obtained by analyzing the likelihood characteristic value of each node and the number of nodes in the web page set by the server, and the designated thresholds corresponding to different web page sets may be the same or different, which is not limited in the embodiment of the present invention.
In the embodiment of the present invention, a node with a likelihood eigenvalue being greater than the specified threshold may be considered to be similar to many nodes of other web pages in the web page set, that is, the node appears "frequently" in the web page set, and then the node is regarded as the specified type node. And if the node with the likelihood characteristic value not greater than the specified threshold is similar to few nodes of other web pages in the web page set, i.e. the node appears less frequently in the web page set, the node is not the specified type node.
204. And the server filters the webpage to be displayed based on the determined specified type node.
When the server determines the designated type nodes in the webpage set, the webpage to be displayed belonging to the same group with the webpage set can be filtered, and the designated type nodes in the webpage to be displayed are filtered. Specifically, the server generates a template configuration file according to the determined specified type node, and then filters the web page to be displayed based on the template configuration file.
In the embodiment of the invention, when a user wants to browse a webpage under the premise of filtering out a specified type node, the operation of accessing and filtering the webpage can be triggered on the terminal, when the terminal acquires the operation of accessing and filtering the webpage, a network filtering display request is sent to the server, the webpage filtering display request carries a webpage address, when the server receives the webpage display request, an original webpage corresponding to the webpage display request can be acquired according to the webpage address, a webpage set which belongs to the same group as the original webpage is determined according to the specified rule, a template configuration file corresponding to the webpage set is acquired, then the original webpage is filtered based on the template configuration file, the specified type node included in the original webpage is filtered out, the filtered webpage is sent to the terminal, and when the terminal receives the filtered webpage, and displaying the filtered webpage. The filtered webpage comprises the content of the webpage and does not comprise the specified type node irrelevant to the content of the webpage, so that when a user browses the webpage, the interference of the specified type node can be avoided, and a more refreshing browsing experience is provided for the user.
Wherein the template configuration file may be a white list or a black list, and correspondingly, the step 204 may include any one of the following steps 204a and 204 b:
204a, the server outputs the determined specified type nodes to a blacklist template configuration file, when a webpage filtering display request is received, an original webpage corresponding to the webpage filtering display request is obtained, and the original webpage is filtered based on the blacklist template configuration file so as to filter the specified type nodes included in the original webpage.
The server can generate a blacklist template configuration file for the webpage set, output the determined specified type nodes to the blacklist template configuration file, store the blacklist template configuration file, if the nodes in the blacklist template configuration file are the specified type nodes to be filtered, when the server receives a webpage filtering display request sent by a terminal, obtain a corresponding original webpage, and filter out the nodes included in the blacklist template configuration file in the original webpage based on the blacklist template configuration file, so that the specified type nodes included in the original webpage are filtered out.
204b, the server outputs the nodes except the specified type nodes in the multiple webpages to a white list template configuration file, acquires the original webpage corresponding to the webpage filtering display request when the webpage filtering display request is received, and filters the original webpage based on the white list template configuration file so as to filter the specified type nodes included in the original webpage.
The server can generate a white list template configuration file for the webpage set, output nodes except the specified type nodes in the webpages to the white list template configuration file, store the white list template configuration file, enable the nodes in the white list template configuration file to be webpage nodes to be reserved, obtain corresponding original webpages when the server receives webpage filtering display requests sent by a terminal, and filter out the nodes which are not included in the white list template configuration file in the original webpages based on the white list template configuration file, so that the specified type nodes included in the original webpages are filtered out.
When the user uses the mobile terminal, the step 204 may be applied to a transcoding process of the server, and when the server acquires the original webpage, the original webpage is transcoded based on the template configuration file, so that the transcoded webpage does not include the specified type node.
It should be noted that, in the embodiment of the present invention, the server takes the currently generated web page as the web page to be analyzed as an example, but in practical applications, the server may update the web page due to service upgrade, anti-crawling, and the like, and once the web page is updated, the content in the web page or the position of the content in the web page may change, and the specified type node in the web page may also change. In order to ensure the timeliness of the template configuration file, the server also updates the template configuration file.
Optionally, the server obtains a plurality of webpages generated within a specified duration before the current time point, that is, every other specified duration, the server obtains a plurality of webpages generated within a specified duration before the current time point, performs the above step 201 and 204 on the plurality of webpages to obtain an updated template configuration file, and filters the webpages to be displayed based on the updated template configuration file. The specified duration may be determined by the server according to an interval between time points of updating the web page, and may be one day or several days, and the like.
In order to avoid the influence of the update process on the current service of the server, the server may perform the step 201 and the step 204 offline when acquiring the multiple webpages, in this process, the server may filter the webpages to be displayed based on the old template configuration file, and when acquiring the updated template configuration file, the server loads the updated template configuration file and filters the webpages to be displayed based on the updated template configuration file.
In the prior art, filtering templates are manually configured, when a web server updates a web page, the originally configured filtering template becomes invalid, and operators need to monitor the updating condition of each web page to find the invalid template and then reconfigure a new template, thereby consuming excessive labor cost. In practical application, operators can hardly find out invalid templates in time, and timeliness is poor. In the embodiment of the invention, the server automatically acquires a plurality of newly generated webpages every specified time, repeatedly executes the training steps in a rolling manner, and updates the template configuration file in time, the whole training process is unsupervised, the automation is repeated, the labor cost is greatly reduced, the timeliness of the template configuration file is ensured, and the influence on the current service is avoided by adopting an off-line training mode.
According to the method provided by the embodiment of the invention, the possibility characteristic value of each node in each webpage in the webpage set is calculated, the node with the possibility characteristic value larger than the designated threshold value is used as the designated type node, the webpage to be displayed can be filtered directly based on the determined designated type node, a filtering template does not need to be manually configured, the operation is simple, convenient and fast, and the time cost and the labor cost are saved. Furthermore, a plurality of newly generated webpages are automatically acquired, the training steps are repeatedly executed, the template configuration file is updated in time, the labor cost is greatly reduced, the timeliness of the template configuration file is ensured, and the influence on the current business is avoided by adopting an off-line training mode.
Fig. 6 is a schematic structural diagram of a web page filtering apparatus according to an embodiment of the present invention, referring to fig. 6, the apparatus includes:
a web page set obtaining module 601, configured to obtain a web page set to be analyzed, where the web page set includes multiple web pages, and each web page includes multiple nodes;
a calculating module 602, configured to calculate, for each node in each web page, a likelihood feature value of the node, where the likelihood feature value is used to indicate a likelihood size that the node is a node of a specified type;
a designated type node determining module 603, configured to determine a node with a likelihood feature value greater than a designated threshold as the designated type node;
and a filtering module 604, configured to filter the to-be-displayed web page based on the determined specified type node.
According to the device provided by the embodiment of the invention, the possibility characteristic value of each node in each webpage in the webpage set is calculated, the node with the possibility characteristic value larger than the designated threshold value is used as the designated type node, the webpage to be displayed can be filtered directly based on the determined designated type node, a filtering template does not need to be manually configured, the operation is simple, convenient and fast, and the time cost and the labor cost are saved.
Optionally, the calculating module 602 is configured to calculate, according to the content of each node, a similarity between the node and each node in other webpages in the webpage set except the webpage; and carrying out statistics on the similarity of the node and each node in other webpages to obtain the probability characteristic value of the node.
Optionally, the apparatus further comprises:
and the node grouping module is used for grouping a plurality of nodes in the plurality of webpages according to the position of each node in the corresponding webpage to obtain a plurality of node sets, and the plurality of nodes in each node set are positioned at the same position in different webpages.
Optionally, the calculating module 602 is configured to calculate, for each node in each node set, a similarity between the node and another node in the node set according to the content of each node; and carrying out statistics on the similarity of the node and other nodes in the node set to obtain the probability characteristic value of the node.
Optionally, the web page set obtaining module 601 is configured to obtain a plurality of web pages generated within a specified duration before a current time point; and grouping the multiple webpages to obtain multiple webpage sets.
Optionally, the web page set obtaining module 601 is specifically configured to group the multiple web pages according to the publishing account of each web page to obtain multiple web page sets; or grouping the multiple webpages according to the storage directory of each webpage to obtain multiple webpage sets; or grouping the multiple webpages according to the subdomain name of each webpage to obtain multiple webpage sets.
Optionally, the filtering module 604 is configured to output the determined node of the specified type to a blacklist template configuration file; when a webpage filtering and displaying request is received, acquiring an original webpage corresponding to the webpage filtering and displaying request; and filtering the original webpage based on the blacklist template configuration file so as to filter out the specified type nodes included in the original webpage.
Optionally, the filtering module 604 is configured to output nodes of the multiple webpages except the node of the specified type to a white list template configuration file; when a webpage filtering and displaying request is received, acquiring an original webpage corresponding to the webpage filtering and displaying request; and filtering the original webpage based on the white list template configuration file so as to filter out the specified type nodes included in the original webpage.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
It should be noted that: in the web page filtering apparatus provided in the above embodiment, when filtering a web page, only the division of the above functional modules is taken as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules to complete all or part of the above described functions. In addition, the web page filtering apparatus and the web page filtering method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.
Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server may be used for functions performed by the server in the web page filtering method shown in the foregoing embodiment. Specifically, the method comprises the following steps: referring to fig. 7, the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 722 (e.g., one or more processors) and a memory 732, one or more storage media 730 (e.g., one or more mass storage devices) storing applications 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 730 may include one or more modules (not shown).
The server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, and/or one or more operating systems 741, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
One or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
acquiring a webpage set to be analyzed, wherein the webpage set comprises a plurality of webpages, and each webpage comprises a plurality of nodes;
for each node in each webpage, calculating a possibility characteristic value of the node, wherein the possibility characteristic value is used for representing the possibility size that the node is a node of a specified type;
determining the node with the possibility characteristic value larger than a specified threshold value as the node with the specified type;
and filtering the webpage to be displayed based on the determined specified type node.
Optionally, further comprising instructions for:
according to the content of each node, calculating the similarity between the node and each node in other webpages except the webpage in the webpage set;
and carrying out statistics on the similarity of the node and each node in other webpages to obtain the probability characteristic value of the node.
Optionally, further comprising instructions for:
and grouping the multiple nodes in the multiple webpages according to the position of each node in the corresponding webpage to obtain multiple node sets, wherein the multiple nodes in each node set are positioned at the same position in different webpages.
Optionally, further comprising instructions for:
for each node in each node set, calculating the similarity between the node and other nodes in the node set according to the content of each node;
and carrying out statistics on the similarity of the node and other nodes in the node set to obtain the probability characteristic value of the node.
Optionally, further comprising instructions for:
acquiring a plurality of webpages generated within a specified duration before a current time point;
and grouping the multiple webpages to obtain multiple webpage sets.
Optionally, further comprising instructions for:
grouping the multiple webpages according to the issuing account number of each webpage to obtain multiple webpage sets; or,
grouping the multiple webpages according to the storage directory of each webpage to obtain multiple webpage sets; or,
and grouping the multiple webpages according to the subdomain name of each webpage to obtain multiple webpage sets.
Optionally, further comprising instructions for:
outputting the determined appointed type node to a blacklist template configuration file;
when a webpage filtering and displaying request is received, acquiring an original webpage corresponding to the webpage filtering and displaying request;
and filtering the original webpage based on the blacklist template configuration file so as to filter out the specified type nodes included in the original webpage.
Optionally, further comprising instructions for:
outputting nodes except the nodes of the specified type in the multiple webpages to a white list template configuration file;
when a webpage filtering and displaying request is received, acquiring an original webpage corresponding to the webpage filtering and displaying request;
and filtering the original webpage based on the white list template configuration file so as to filter out the specified type nodes included in the original webpage.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A method for filtering web pages, the method comprising:
acquiring a plurality of webpages generated within a specified duration before a current time point;
grouping the multiple webpages according to the issuing account number of each webpage to obtain multiple webpage sets; or grouping the multiple webpages according to the storage directory of each webpage to obtain multiple webpage sets; or grouping the multiple webpages according to the subdomain names of the webpages to obtain multiple webpage sets, wherein each webpage set comprises the multiple webpages;
for each webpage, dividing the webpage into a plurality of nodes according to formats of text nodes, picture nodes, video nodes and webpage link address nodes;
for each node, calculating a likelihood characteristic value of the node, wherein the likelihood characteristic value is used for representing the likelihood size of the node being a node of a specified type;
determining nodes with the possibility characteristic values larger than a specified threshold value as the specified type nodes;
filtering the webpage to be displayed based on the determined designated type node;
the method further comprises the following steps:
grouping a plurality of nodes in the plurality of webpages according to the position of each node in the corresponding webpage to obtain a plurality of node sets, wherein the plurality of nodes in each node set are positioned at the same position in different webpages;
for each node, the calculating the likelihood feature value of the node comprises:
for each node in each node set, calculating the similarity between the node and other nodes in the node set according to the content of each node;
and carrying out statistics on the similarity of the node and other nodes in the node set to obtain the probability characteristic value of the node.
2. The method of claim 1, wherein for each node, calculating the likelihood eigenvalue for the node comprises:
according to the content of each node, calculating the similarity between the node and each node in other webpages except the webpage in the webpage set;
and carrying out statistics on the similarity of the node and each node in the other webpages to obtain the probability characteristic value of the node.
3. The method of claim 1, wherein filtering the web page to be displayed based on the determined node of the specified type comprises:
outputting the determined appointed type node to a blacklist template configuration file;
when a webpage filtering display request is received, acquiring an original webpage corresponding to the webpage filtering display request;
and filtering the original webpage based on the blacklist template configuration file so as to filter out the specified type nodes included in the original webpage.
4. The method of claim 1, wherein filtering the web page to be displayed based on the determined node of the specified type comprises:
outputting nodes except the nodes of the specified type in the multiple webpages to a white list template configuration file;
when a webpage filtering display request is received, acquiring an original webpage corresponding to the webpage filtering display request;
and filtering the original webpage based on the white list template configuration file so as to filter out the specified type nodes included in the original webpage.
5. A web page filtering apparatus, the apparatus comprising:
the webpage set acquisition module is used for acquiring a plurality of webpages generated within a specified time before the current time point;
the webpage set acquisition module is further used for grouping the plurality of webpages according to the issuing account of each webpage to obtain a plurality of webpage sets; or grouping the multiple webpages according to the storage directory of each webpage to obtain multiple webpage sets; or grouping the multiple webpages according to the subdomain names of the webpages to obtain multiple webpage sets, wherein each webpage set comprises the multiple webpages;
the webpage set acquisition module is also used for dividing each webpage into a plurality of nodes according to the formats of a text node, a picture node, a video node and a webpage link address node;
a calculation module, configured to calculate, for each node, a likelihood feature value of the node, where the likelihood feature value is used to indicate a likelihood size that the node is a node of a specified type;
a designated type node determining module, configured to determine a node with a likelihood feature value greater than a designated threshold as the designated type node;
the filtering module is used for filtering the webpage to be displayed based on the determined specified type node;
the device further comprises:
the node grouping module is used for grouping a plurality of nodes in the plurality of webpages according to the position of each node in the corresponding webpage to obtain a plurality of node sets, and the plurality of nodes in each node set are positioned at the same position in different webpages;
the calculation module is further configured to calculate, for each node in each node set, a similarity between the node and another node in the node set according to the content of each node; and carrying out statistics on the similarity of the node and other nodes in the node set to obtain the probability characteristic value of the node.
6. The apparatus according to claim 5, wherein the calculation module is configured to calculate a similarity between each node and each node in the other webpages in the set of webpages except the webpage according to the content of each node; and carrying out statistics on the similarity of the node and each node in the other webpages to obtain the probability characteristic value of the node.
7. The apparatus of claim 5, wherein the filtering module is configured to output the determined nodes of the specified type into a blacklist template configuration file; when a webpage filtering display request is received, acquiring an original webpage corresponding to the webpage filtering display request; and filtering the original webpage based on the blacklist template configuration file so as to filter out the specified type nodes included in the original webpage.
8. The apparatus of claim 5, wherein the filtering module is configured to output nodes of the plurality of web pages other than the specified type of node into a whitelist template configuration file; when a webpage filtering display request is received, acquiring an original webpage corresponding to the webpage filtering display request; and filtering the original webpage based on the white list template configuration file so as to filter out the specified type nodes included in the original webpage.
9. A server for filtering web pages, the server comprising a memory and one or more processors, one or more programs being stored in the memory and configured to be executed by the one or more processors to implement the web page filtering method according to any one of claims 1 to 4.
10. A computer-readable storage medium storing one or more programs and configured for execution by one or more processors to implement the web page filtering method of any one of claims 1 to 4.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410648193.1A CN105653550B (en) | 2014-11-14 | 2014-11-14 | Webpage filtering method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410648193.1A CN105653550B (en) | 2014-11-14 | 2014-11-14 | Webpage filtering method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN105653550A CN105653550A (en) | 2016-06-08 |
| CN105653550B true CN105653550B (en) | 2019-11-05 |
Family
ID=56479084
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410648193.1A Active CN105653550B (en) | 2014-11-14 | 2014-11-14 | Webpage filtering method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN105653550B (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106326455A (en) * | 2016-08-26 | 2017-01-11 | 乐视控股(北京)有限公司 | Web page browsing filtering processing method and system, terminal and cloud acceleration server |
| CN106599246B (en) * | 2016-12-20 | 2020-02-11 | 维沃移动通信有限公司 | Display content interception method, mobile terminal and control server |
| CN108628888A (en) * | 2017-03-21 | 2018-10-09 | 中兴通讯股份有限公司 | A kind of browser Ad blocking method, apparatus and terminal |
| CN107423059A (en) * | 2017-07-07 | 2017-12-01 | 北京小米移动软件有限公司 | Display methods, device and the terminal of the page |
| CN109756393B (en) * | 2018-12-27 | 2021-04-30 | 阿里巴巴(中国)有限公司 | Information processing method, system, medium, and computing device |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101399818A (en) * | 2007-09-25 | 2009-04-01 | 日电(中国)有限公司 | Theme related webpage filtering method and system based on navigation route information |
| CN103678313A (en) * | 2012-08-31 | 2014-03-26 | 北京百度网讯科技有限公司 | A method and device for evaluating the authority of a webpage |
| CN103870590A (en) * | 2014-03-28 | 2014-06-18 | 北京奇虎科技有限公司 | Webpage identification method and device with error-reported characteristic |
-
2014
- 2014-11-14 CN CN201410648193.1A patent/CN105653550B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101399818A (en) * | 2007-09-25 | 2009-04-01 | 日电(中国)有限公司 | Theme related webpage filtering method and system based on navigation route information |
| CN103678313A (en) * | 2012-08-31 | 2014-03-26 | 北京百度网讯科技有限公司 | A method and device for evaluating the authority of a webpage |
| CN103870590A (en) * | 2014-03-28 | 2014-06-18 | 北京奇虎科技有限公司 | Webpage identification method and device with error-reported characteristic |
Non-Patent Citations (1)
| Title |
|---|
| 网页噪声识别与消除方法研究;秦超;《中国优秀硕士学位论文全文数据库信息科技辑》;20120515;第I139-278页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN105653550A (en) | 2016-06-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108200220B (en) | Data synchronization method, server and storage medium | |
| CN109190024B (en) | Information recommendation method and device, computer equipment and storage medium | |
| CN105183912B (en) | Abnormal log determines method and apparatus | |
| CN105653550B (en) | Webpage filtering method and device | |
| CN107256232B (en) | Information recommendation method and device | |
| US10346496B2 (en) | Information category obtaining method and apparatus | |
| CN105224623A (en) | The training method of data model and device | |
| CN105245583A (en) | Promotion information pushing method and device | |
| US11122142B2 (en) | User behavior data processing method and device, and computer-readable storage medium | |
| CN105183873A (en) | Malicious clicking behavior detection method and device | |
| EP2802979A2 (en) | Processing store visiting data | |
| CN109376318A (en) | A kind of page loading method, computer readable storage medium and terminal device | |
| JP2011227721A (en) | Interest extraction device, interest extraction method, and interest extraction program | |
| CN113468354A (en) | Method and device for recommending chart, electronic equipment and computer readable medium | |
| WO2022007626A1 (en) | Video content recommendation method and apparatus, and computer device | |
| CN110019823A (en) | Update the method and device of knowledge mapping | |
| CN113220657A (en) | Data processing method and device and computer equipment | |
| CN104391843A (en) | System and method for recommending files | |
| CN108647312A (en) | A kind of user preference analysis method and its device | |
| CN109117448B (en) | Thermodynamic diagram generation method and device | |
| CN104123321B (en) | A kind of determining method and device for recommending picture | |
| US20160307223A1 (en) | Method for determining a user profile in relation to certain web content | |
| CN105184321B (en) | Data processing method and device for ftrl model | |
| CN117634894A (en) | Ecological environment risk assessment method and device, electronic equipment and storage medium | |
| CN105653724A (en) | Page exposure monitoring method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |