CN106250402B

CN106250402B - Website classification method and device

Info

Publication number: CN106250402B
Application number: CN201610574744.3A
Authority: CN
Inventors: 张惊申; 任方英
Original assignee: New H3C Technologies Co Ltd
Current assignee: New H3C Technologies Co Ltd
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2022-01-21
Anticipated expiration: 2036-07-19
Also published as: CN106250402A

Abstract

The embodiment of the invention discloses a website classification method and a device, wherein the method comprises the following steps: acquiring first label information and first webpage content of a website to be classified, wherein the first label information is a part of the first webpage content; determining the website category corresponding to the first tag information according to a preset tag classification dictionary, wherein the tag classification dictionary comprises: the corresponding relation between the label information and the website category; and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content. By applying the technical scheme provided by the embodiment of the invention, the efficiency of website classification is improved.

Description

Website classification method and device

Technical Field

The invention relates to the technical field of internet, in particular to a website classification method and device.

Background

The number of web sites in the internet is extremely large, which includes various web sites, such as: news-like websites, sports-like websites, shopping-like websites, and the like. In the face of a wide variety of websites, businesses or organizations often need to filter the websites to prohibit insiders from accessing a given category of websites. Here, it is determined whether a web site needs to be filtered out, and the web sites need to be classified first.

Currently, the process of website classification is generally as follows: determining the content in the website pages to be accessed, and matching the determined content with words in all preset website classification dictionaries, wherein each website class corresponds to one website classification dictionary, and the website classification dictionaries comprise: corresponding relation between words and weighted values; and determining the category of the website to be accessed according to the matched weight value. When the website category is determined, the website category is matched with all words in all the website classification dictionaries, so that the efficiency of website classification is low.

Disclosure of Invention

The embodiment of the invention discloses a website classification method and device, which improve the efficiency of website classification.

In order to achieve the above object, the embodiment of the present invention discloses a website classification method, which comprises:

acquiring first label information and first webpage content of a website to be classified, wherein the first label information is a part of the first webpage content;

determining the website category corresponding to the first tag information according to a preset tag classification dictionary, wherein the tag classification dictionary comprises: the corresponding relation between the label information and the website category;

and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.

In order to achieve the above object, an embodiment of the present invention further discloses a website classification apparatus, where the apparatus includes:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring first label information and first webpage content of a website to be classified, and the first label information is a part of the first webpage content;

a first determining unit, configured to determine, according to a preset tag classification dictionary, a website category corresponding to the first tag information, where the tag classification dictionary includes: the corresponding relation between the label information and the website category;

and the second determining unit is used for determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.

The embodiment of the invention provides a website classification method and device, wherein first label information and first webpage content of a website to be classified are obtained, the first label information is less, the first webpage content is more, website classes which are possibly websites to be classified are screened out from all website classes according to the first label information and a preset label classification dictionary, and the website classes of the website to be classified are determined according to the first webpage content and a website classification dictionary corresponding to the determined website classes, so that the website classification efficiency is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a website classification method according to an embodiment of the present invention;

fig. 2 is a schematic view illustrating a construction process of a classification dictionary in the website classification method according to the embodiment of the present invention;

fig. 3 is a schematic structural diagram of a website classification device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a device for constructing a classification dictionary used in the website classification device according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention will be described in detail below with reference to specific examples.

Referring to fig. 1, fig. 1 is a schematic flowchart of a website classification method according to an embodiment of the present invention, where the method includes:

s101: acquiring first label information and first webpage content of a website to be classified;

here, the website to be classified may be a website that the user needs to visit, or may be a website preset by the user. The first tag information is a part of the first webpage content, and may be title information of the first webpage content, such as "tianmao supermarket", "Baidu post bar", and the like; column titles in the first web content, such as "entertainment stars", "movies", and "novels" in "Baidu Bar" may also be provided.

It should be noted that, in this embodiment, the first tag information is not limited, and can represent the content of the website features, and all of the content can be used as the first tag information.

In an embodiment of the present invention, a URL (Uniform Resource Locator) of a website to be classified may be first obtained, a web crawler tool is used to access the URL, and tag information and web page content of the website are extracted from content fed back by the website.

S102: determining a website category corresponding to the first label information according to a preset label classification dictionary;

wherein the label classification dictionary comprises: and the corresponding relation between the label information and the website category.

The words contained in the label information are few, the first label information is matched with the label classification dictionary, and the website category corresponding to the first label information can be quickly determined.

In one embodiment of the invention, the tag information in the tag classification dictionary may be tag words. At this time, if the first label information and the label classification dictionary are matched, a mismatch may occur, such as: the label information is "Beijing university research center", and in the label information, two characters of "big" and "study" are adjacent and can be matched by the label word of "university", but in practice, the two characters of "big" and "study" belong to different words and are respectively "maximum" and "study", and at this time, if the label information is matched by the label word of "university", the problem of mismatching occurs.

In order to avoid the problem of mismatching, the first label information may be segmented to obtain at least one first label word, such as: the label information 'Beijing university research center' is subjected to word segmentation to obtain a label word: the label information can be prevented from being matched by the label word of university, and the problem of mismatching is effectively avoided.

After at least one first tag word is obtained, the website category corresponding to each first tag word can be determined, and then the website category corresponding to the first tag information is determined. Specifically, it may be: matching each first label word with the label words in the label classification dictionary, and gathering the website categories corresponding to each matched label word together to obtain an initial classification set corresponding to the first label information; and removing the repeated website categories in the initial classification set, and determining the website categories in the initial classification set after the repeated website categories are removed to be the website categories corresponding to the first label information. In one embodiment, the website categories corresponding to the first tag information may be collected together to serve as a suspected classification set of the websites to be classified.

Supposing that the obtained first label information of the website to be classified is as follows: "dica cannon sports supermarket | professional sporting goods store monopoly", this first label information is participled, obtains 7 first label words: "dicarbanon", "sports", "supermarket", "professional", "sporting goods", "shop" and "monopoly", matching each first tagged term with a tagged term in a tag classification dictionary, can determine:

the website categories corresponding to "sports" are: "sports";

the website categories corresponding to the supermarket are as follows: "shopping" and "business";

the website categories corresponding to the "stores" are: "shopping" and "business";

the other 4 words do not belong to any one website category.

At this time, the initial classification set corresponding to the first tag information may be determined as: { "sports", "shopping", "business" }, removing repeated website categories { "shopping" and "business" }inthe initial classification set, and determining the website category corresponding to the first label information, that is, the suspected classification set of the website to be classified is: { "sports", "shopping", "business" }.

S103: and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.

In one embodiment of the present invention, to reduce the data amount of the matched words and avoid adding invalid words to the website classification dictionary, the website classification dictionary corresponding to each website category includes: valid words for the website category and a weight value for each valid word. Here, the invalid words include: webpage code of non-webpage effective content, script character set, annotated character set, and the like.

Under the condition, when the website category of the website to be classified is determined, the text information of the first webpage content can be extracted, and the extracted text information is subjected to word segmentation to obtain at least one first effective word; obtaining a first weight value of each first effective word aiming at each website category according to the website classification dictionary corresponding to the determined website category and each first effective word; and calculating the sum of first weight values corresponding to each website category, and taking the website category with the maximum sum of the first weight values as the website category of the websites to be classified.

As assumed in S102, if the first valid word obtained from the first web page content includes: x₁、X₂、X₃、X₄And X₅And respectively matching the first effective words with effective words in a website classification dictionary corresponding to the sports website classification, the shopping website classification and the commercial website classification, and determining:

"sports" website classification: x₁Is 100; x₂Has a first weight value of 200; x₃Has a first weight value of 240; x₄Has a first weight value of 70; x₅The first weight value of (1) is 300;

the 'shopping' website classification: x₁Has a first weight value of 400; x₂The first weight value of (1) is 300; x₃Has a first weight value of 500; x₄Is 1460; x₅Is 1330;

the "commercial" website classification: x₁Has a first weight value of 50; x₂Is 100; x₃The first weight value of (1) is 300; x₄Has a first weight value of 20; x₅Has a first weight value of 150;

according to the obtained first weight values, the sum of the finally calculated first weight values corresponding to each website category is as follows:

the sum of the weighted values of the sports website classification is: 910;

the sum of the weighted values of the 'shopping' website classification is: 2990;

the sum of the weight values for the "commercial" website classification is: 620;

at this time, the website category of the website to be classified may be determined as "shopping".

In practical applications, there are some website categories, which are prone to cause misclassification when being matched with all other website categories, such as: the "news" category of web sites, which includes a wide variety of web page types, may include: web page types such as "shopping," "sports," "business," and "education"; the following steps are repeated: the "advertisement" website category, which is specific to this type of website. In the embodiment of the invention, firstly, the website category corresponding to the label information, namely a suspected classification set, is determined through the label information, and then the website category of the website to be classified is determined according to the website classification dictionary corresponding to the determined website category and the first webpage content.

In an embodiment of the present invention, when determining the website category corresponding to the tag information, the website category may not be determined, and the determined website category is empty, that is, the above-mentioned suspected classification set is empty, and in this case, in order to determine the website category of the website to be classified, the website category of the website to be classified may be determined according to all the website classification dictionaries and the first webpage content.

In an embodiment of the present invention, to ensure website classification, before acquiring first tag information and first web page content of a website to be classified, a tag classification dictionary and a website classification dictionary need to be constructed in advance, and in particular, referring to fig. 2, the method includes:

s201: configuring N initial website categories, wherein N is a positive integer;

here, the initial website categories may include: "news," "sports," "finance," and so on. In addition, all website classifications can be set as a first-level classification, and can also be subdivided into a second-level classification and a third-level classification, such as: can set up "news" for the first grade is categorised, sets up the second grade classification under "news" is categorised: "current events", "sports", "shopping", etc.; can set up "finance" for the primary classification, set up the secondary classification under "finance" classification: "bank", "securities", etc.

S202: acquiring second label information and second webpage content of at least one sample website corresponding to each initial website category;

specifically, a URL of at least one sample website corresponding to the initial website category is obtained, the corresponding sample website URL is accessed according to the website category through a web crawler tool, and the label information and the webpage content of the sample website are extracted from the content fed back by the sample website. Suppose that: the determined initial website categories are: "sports" and "shopping", the URL of the sample website corresponding to the category of the "sports" initial website can be obtained as: URLs of sports websites such as Xinlang sports, Fox searching sports, Tencent sports and the like are accessed, and label information and webpage content corresponding to the category of the 'sports' initial website are obtained; the URL of the sample website corresponding to the initial website category of shopping is obtained as follows: URLs of shopping websites such as Taobao, Wei-Hui and Jumei excellence are accessed, and tag information and webpage content corresponding to the category of the initial shopping website are acquired.

S203: for each initial website category, extracting second label words from second label information of each corresponding sample website, and correspondingly storing the second label words and the initial website category to a label classification dictionary;

each second label information of the sample website is segmented, words closely related to the initial website category corresponding to the sample website are extracted from the segmented words, and the extracted words are used as second label words, so that the second label words and the initial website category are correspondingly stored in a label classification dictionary, the data quantity of information stored in the label classification dictionary is reduced, and the website classification speed can be further improved. As mentioned above, the category of the initial website of "shopping" may be extracted from the sample websites of "panning, virtuous, and gathering of beautiful products" to obtain the second tagged word: the second label terms such as the supermarket, the flagship store and the shop are correspondingly stored to the label classification dictionary along with the shopping.

S204: for each initial website category, segmenting the text information of the second webpage content of each corresponding sample website, removing invalid words to obtain at least one second valid word, and configuring a second weight value for each second valid word; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.

It should be noted that S204 may be executed before S203, or may be executed simultaneously with S203, which is not limited in the present invention. Here, the website classification dictionary may be in a table form or a text form. In this case, all the website classification dictionaries may be placed in one classification dictionary set, that is, all the website classification dictionaries may be placed in one table or text; each website classification dictionary may also be stored separately, i.e., each website classification dictionary is placed in a table or text.

The embodiment of the invention provides a website classification method, which comprises the steps of obtaining first label information and first webpage content of a website to be classified, wherein the first label information is less, the first webpage content is more, screening website categories which are possibly the website to be classified from all the website categories according to the first label information and a preset label classification dictionary, and determining the website categories of the website to be classified according to the first webpage content and a website classification dictionary corresponding to the determined website categories, so that the website classification efficiency is effectively improved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a website classification device according to an embodiment of the present invention, the device including:

a first obtaining unit 301, configured to obtain first tag information and first web content of a website to be classified, where the first tag information is a part of the first web content;

a first determining unit 302, configured to determine, according to a preset tag classification dictionary, a website category corresponding to the first tag information, where the tag classification dictionary includes: the corresponding relation between the label information and the website category;

a second determining unit 303, configured to determine a website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.

In an embodiment of the present invention, the first obtaining unit 301 is specifically configured to:

acquiring a Uniform Resource Locator (URL) of a website to be classified; and accessing the URL to acquire first label information and first webpage content of the website to be classified.

In one embodiment of the present invention, the tag information in the tag classification dictionary is a tag word;

in this case, the first determining unit 302 may include:

a first word segmentation subunit (not shown in fig. 3) configured to perform word segmentation on the first tag information to obtain at least one first tag word;

a first determining subunit (not shown in fig. 3) configured to determine, according to a preset tag classification dictionary, a website category corresponding to each first tag word;

a second determining subunit (not shown in fig. 3), configured to determine, as the website category corresponding to the first tag information, the website type corresponding to each first tag word.

In one embodiment of the present invention, the website classification dictionary of each website category includes valid words of the website category and a weight value of each valid word;

in this case, the second determining unit 303 may include:

a second word segmentation subunit (not shown in fig. 3) configured to segment the text information of the first web content to obtain at least one first valid word;

an obtaining subunit (not shown in fig. 3) configured to obtain, according to the website classification dictionary corresponding to the determined website category and each first valid word, a first weight value of each first valid word for each website category;

and a third determining subunit (not shown in fig. 3) configured to determine the website category with the largest sum of the first weight values as the website category of the websites to be classified.

In order to ensure website classification, an embodiment of the present invention provides a device for constructing a classification dictionary used in a website classification device, which may refer to fig. 4, where the device includes:

a configuration unit 401, configured to configure N initial website categories, where N is a positive integer;

a second obtaining unit 402, configured to obtain second tag information and second web page content of at least one sample website corresponding to each initial website category;

a first extracting unit 403, configured to, for each initial website category, extract a second tagged word from second tag information of each corresponding sample website, and store the second tagged word and the initial website category in the tag classification dictionary in a corresponding manner;

a second extracting unit 404, configured to, for each initial website category, perform word segmentation on text information of the second web content of each corresponding sample website, remove an invalid word, obtain at least one second valid word, and configure a second weight value for each second valid word; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.

In an embodiment of the present invention, the website classifying device may further include:

a third determining unit (not shown in fig. 3), configured to determine, if the determined website category is empty, a website category of the website to be classified according to all the website classification dictionaries and the first webpage content.

The embodiment of the invention provides a website classification device, which is used for acquiring first label information and first webpage content of a website to be classified, wherein the first label information is less, the first webpage content is more, the website category which is possibly the website to be classified is screened out from all website categories according to the first label information and a preset label classification dictionary, and then the website category of the website to be classified is determined according to the first webpage content and a website classification dictionary corresponding to the determined website category, so that the website classification efficiency is effectively improved.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those skilled in the art will appreciate that all or part of the steps in the above method embodiments may be implemented by a program to instruct relevant hardware to perform the steps, and the program may be stored in a computer-readable storage medium, which is referred to herein as a storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for classifying a website, the method comprising:

determining the website category corresponding to the first tag information according to a preset tag classification dictionary, wherein the tag classification dictionary comprises: the corresponding relation between the label information and the website category; the website category corresponding to the first label information is a suspected website category of the website to be classified;

2. The method of claim 1, wherein the obtaining the first label information and the first webpage content of the website to be classified comprises:

acquiring a Uniform Resource Locator (URL) of a website to be classified;

and accessing the URL to acquire first label information and first webpage content of the website to be classified.

3. The method of claim 1, wherein the label information in the label classification dictionary is a label word;

the determining the website category corresponding to the first tag information according to a preset tag classification dictionary includes:

performing word segmentation on the first label information to obtain at least one first label word;

determining the website category corresponding to each first label word according to a preset label classification dictionary;

and determining the website type corresponding to each first label word as the website category corresponding to the first label information.

4. The method of claim 1, wherein the website classification dictionary of each website category comprises valid words of the website category and a weight value of each valid word;

determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content, wherein the determining the website category of the website to be classified comprises the following steps:

segmenting words of the text information of the first webpage content to obtain at least one first effective word;

obtaining a first weight value of each first effective word aiming at each website category according to the website classification dictionary corresponding to the determined website category and each first effective word;

and determining the website category with the maximum sum of the first weight values as the website category of the websites to be classified.

5. The method of claim 1, wherein the label information in the label classification dictionary is a label word;

before the obtaining of the first label information and the first webpage content of the website to be classified, the method further includes:

configuring N initial website categories, wherein N is a positive integer;

acquiring second label information and second webpage content of at least one sample website corresponding to each initial website category;

for each initial website category, extracting second tag words from second tag information of each corresponding sample website, and correspondingly storing the second tag words and the initial website category to the tag classification dictionary;

for each initial website category, segmenting the text information of the second webpage content of each corresponding sample website, removing invalid words to obtain at least one second valid word, and configuring a second weight value for each second valid word; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.

6. The method of claim 1, wherein if the determined website category is empty, the method further comprises:

and determining the website category of the website to be classified according to all the website classification dictionaries and the first webpage content.

7. An apparatus for classifying a website, the apparatus comprising:

a first determining unit, configured to determine, according to a preset tag classification dictionary, a website category corresponding to the first tag information, where the tag classification dictionary includes: the corresponding relation between the label information and the website category; the website category corresponding to the first label information is a suspected website category of the website to be classified;

8. The apparatus according to claim 7, wherein the first obtaining unit is specifically configured to:

9. The apparatus of claim 7, wherein the label information in the label classification dictionary is a label word;

the first determination unit includes:

the first word segmentation subunit is used for segmenting the first label information to obtain at least one first label word;

the first determining subunit is used for determining the website category corresponding to each first label word according to a preset label classification dictionary;

and the second determining subunit is used for determining the website type corresponding to each first label word as the website category corresponding to the first label information.

10. The apparatus of claim 7, wherein the website classification dictionary of each website category comprises valid words of the website category and a weight value of each valid word;

the second determination unit includes:

the second word segmentation subunit is used for performing word segmentation on the text information of the first webpage content to obtain at least one first effective word;

the obtaining subunit is configured to obtain, according to the website classification dictionary corresponding to the determined website category and each first valid word, a first weight value of each first valid word for each website category;

and the third determining subunit is used for determining the website category with the maximum sum of the first weight values as the website category of the websites to be classified.

11. The apparatus according to any one of claims 7-10, further comprising:

the configuration unit is used for configuring N initial website categories, wherein N is a positive integer;

the second acquisition unit is used for acquiring second label information and second webpage content of at least one sample website corresponding to each initial website type;

the first extraction unit is used for extracting second label words from the second label information of each corresponding sample website for each initial website category and storing the second label words and the initial website category into the label classification dictionary in a corresponding mode;

the second extraction unit is used for segmenting the text information of the second webpage content of each corresponding sample website according to each initial website category, removing invalid terms, obtaining at least one second valid term, and configuring a second weight value for each second valid term; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.

12. The apparatus of claim 7, further comprising:

and the third determining unit is used for determining the website category of the website to be classified according to all the website classification dictionaries and the first webpage content if the determined website category is empty.