CN106339459B - The method that Chinese web page is presorted is carried out based on Keywords matching - Google Patents
The method that Chinese web page is presorted is carried out based on Keywords matching Download PDFInfo
- Publication number
- CN106339459B CN106339459B CN201610741134.8A CN201610741134A CN106339459B CN 106339459 B CN106339459 B CN 106339459B CN 201610741134 A CN201610741134 A CN 201610741134A CN 106339459 B CN106339459 B CN 106339459B
- Authority
- CN
- China
- Prior art keywords
- keyword
- webpage
- tec
- key
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及基于关键词匹配进行中文网页预分类的方法,该方法在制作分类算法所需要的训练集的过程中,给每条训练网页进行人工标注的同时,将网页中表征该网页的关键词也标注出来,生成关键词表;对每一条测试网页,首先根据关键词表提取出该网页中出现的关键词,然后通过与训练集进行关键词匹配计算,将训练集的标签转移给该测试网页;如果该预分类方法未能给出训练网页的分类结果,该测试网页需要进行进一步的分类计算。该方法降低了如SVM、KNN、朴素贝叶斯等计算复杂的分类技术的运行时间,同时也使分类结果的准确率和召回率都得到了提高。
The invention relates to a method for pre-classifying Chinese webpages based on keyword matching. In the process of making the training set required by the classification algorithm, the method manually marks each training webpage, and at the same time, uses keywords representing the webpage in the webpage. Also mark it out to generate a keyword table; for each test webpage, first extract the keywords that appear in the webpage according to the keyword table, and then transfer the labels of the training set to the test by performing keyword matching calculations with the training set. webpage; if the pre-classification method fails to give the classification result of the training webpage, the test webpage needs to be further classified. This method reduces the running time of computationally complex classification techniques such as SVM, KNN, and Naive Bayesian, and also improves the accuracy and recall of classification results.
Description
技术领域technical field
本发明涉及计算机领域的信息处理方面,尤其涉及到基于关键词匹配进行中文网页预分类的方法。The invention relates to information processing in the computer field, in particular to a method for pre-classifying Chinese webpages based on keyword matching.
背景技术Background technique
随着互联网的高速发展,以网页形式存储的信息仍在爆炸式增长,因此对网页信息进行分类成为了人们获取有用信息的不可或缺的方法之一。目前主流的分类算法包括SVM、KNN、朴素贝叶斯三种算法,其中SVM所需的训练集很少,对英文网页的分类效果也十分优秀。以SVM技术为核心的中文网页分类系统的分类结果准确率和召回率都无法达到要求。这是由于英文有天然的分隔符,而中文只能首先使用中文分词器在对网页文本作向量化之前进行分词。然而再优秀的中文分词器也无法使分词完全准确,这极大地影响了中文网页分类的效果。With the rapid development of the Internet, the information stored in the form of webpages is still growing explosively, so classifying webpage information has become one of the indispensable methods for people to obtain useful information. The current mainstream classification algorithms include SVM, KNN, and Naive Bayesian algorithms. Among them, SVM requires very little training set, and its classification effect on English web pages is also very good. The accuracy rate and recall rate of the classification results of the Chinese webpage classification system based on SVM technology cannot meet the requirements. This is because English has natural delimiters, while Chinese can only be segmented using the Chinese tokenizer before quantifying the webpage text. However, no matter how good the Chinese word segmentation device is, it cannot make the word segmentation completely accurate, which greatly affects the effect of Chinese web page classification.
发明内容Contents of the invention
针对上述问题,本发明提出一种基于关键词匹配对中文网页进行预分类的方法,该方法可以大大降低主流分类技术的运行时间,同时提高了分类结果的准确率和召回率。In view of the above problems, the present invention proposes a method for pre-classifying Chinese webpages based on keyword matching, which can greatly reduce the running time of mainstream classification techniques, and at the same time improve the accuracy and recall of classification results.
本发明解决上述技术问题的技术方案如下:The technical scheme that the present invention solves the problems of the technologies described above is as follows:
基于关键词匹配进行中文网页预分类的方法,包括如下步骤:A method for pre-classifying Chinese webpages based on keyword matching, comprising the following steps:
1)标注训练集TRS中每一条训练网页TR的类别标签TAG以及表征该网页类别的关键词集KWS,生成关键词表KWT;1) mark the category label TAG of each training webpage TR in the training set TRS and the keyword set KWS representing the category of the webpage, and generate the keyword table KWT;
2)根据KWT提取测试集TES中每一条测试网页TE中包含的关键词,组成关键词集TEK;2) extract the keywords contained in each test webpage TE in the test set TES according to the KWT, and form the keyword set TEK;
3)计算出TEK的每个二元组(即关键词对)TEC并遍历训练集,将KWS包含该TEC的TR的TAG转移给该TEC对应的TE,并将该TAG存入到该TE的标签集TAGS中;3) Calculate the TEC of each two-tuple (keyword pair) of the TEK and traverse the training set, transfer the TAG of the TR containing the TEC in the KWS to the TE corresponding to the TEC, and store the TAG in the TE's In the label set TAGS;
4)将TAGS中的标签进行频次统计,根据需求取频次最高的几个标签,作为该测试网页的预分类标签。4) Perform frequency statistics on the tags in TAGS, and take the tags with the highest frequency according to the requirements as the pre-classified tags of the test web page.
在上述技术方案的基础上,本发明还可以做如下改进。On the basis of the above technical solutions, the present invention can also be improved as follows.
进一步地,步骤1)中,还包括将属于同一个类别的所有训练网页的关键词集KWS进行去重后生成关键词表KWT。Further, step 1) also includes generating a keyword table KWT after deduplicating the keyword sets KWS of all training webpages belonging to the same category.
进一步地,上述生成关键词表KWT的具体步骤如下:Further, the above-mentioned concrete steps of generating the keyword table KWT are as follows:
1‐1)新建一个映射M,将第一个训练网页TR的KWS每个关键词K作为M的键,相应的初始值均置为1;1-1) Create a new mapping M, use each keyword K of the KWS of the first training webpage TR as the key of M, and set the corresponding initial value to 1;
1‐2)对第二个TR的KWS中的每个关键词K,首先判断M中是否已经包含了K,若存在,将键为K的键值对的值加1;若不存在,则将<K,1>这个键值对加入到M中;1-2) For each keyword K in the KWS of the second TR, first judge whether K is already included in M, if it exists, add 1 to the value of the key-value pair whose key is K; if not, then Add the key-value pair <K,1> to M;
1‐3)对剩下的TR,重复步骤1-2),直至最后一个TR;1‐3) For the remaining TRs, repeat steps 1-2) until the last TR;
1‐4)设定一个阈值s,当键值对的值小于s时,将值置为0;否则将键值对的值置为1。1-4) Set a threshold s, when the value of the key-value pair is less than s, set the value to 0; otherwise, set the value of the key-value pair to 1.
进一步地,步骤2)中计算TEK的二元组TEC的具体步骤如下:Further, the specific steps of calculating the binary group TEC of TEK in step 2) are as follows:
3‐1‐1)对于包含N个关键词K1,K2,……,KN的TEK,按照如下的顺序查找关键词对,每一个关键词对都需进入到步骤3-1-2)进行判断:包含K1的关键词对为<K1,K2>,<K1,K3>,……,<K1,KN>,包含K2的关键词对为<K2,K3>,……,<K2,KN>,这样直至包含K(N-1)的关键词对<K(N-1),KN>;3-1-1) For a TEK containing N keywords K1, K2, ..., KN, search for keyword pairs in the following order, and each keyword pair needs to enter step 3-1-2) for judgment : Keyword pairs containing K1 are <K1,K2>,<K1,K3>,...,<K1,KN>, keyword pairs containing K2 are <K2,K3>,...,<K2,KN> , until the keyword pair <K(N-1), KN> containing K(N-1) is reached;
3‐1‐2)若当前关键词对中至少有一个满足其在M中的对应的值为1,则对TEC遍历训练集TRS;若不满足,回到步骤3-1-1),查找下一个关键词对。3-1-2) If at least one of the current keyword pairs satisfies its corresponding value in M to be 1, traverse the training set TRS for TEC; if not, go back to step 3-1-1) and find The next keyword pair.
进一步地,步骤2)中TEK的所有二元组(即关键词对)TEC中至少有一个关键词为重要关键词,重要关键词为出现频次最高的关键词。Further, in step 2), at least one keyword in TEC of all binary groups (namely keyword pairs) in TEK is an important keyword, and the important keyword is the keyword with the highest frequency of occurrence.
进一步地,步骤2)中还包括对测试网页中出现的关键词进行频次统计,当次要的关键词出现的频次超过设定阈值时,可以使之标注成为重要关键词,在计算测试网页的关键词二元组时得到更多的二元组,提高分类结果的准确率和召回率。Further, step 2) also includes carrying out frequency statistics to the keywords that appear in the test webpage, when the frequency of occurrence of the secondary keywords exceeds the set threshold, it can be marked as an important keyword, and when calculating the test webpage When keyword binary groups are used, more binary groups are obtained, and the accuracy and recall rate of classification results are improved.
其中,“次要关键词”为第一次统计时出现频次并不是很高的关键词,但随着统计的增加,可能出现频次有了改变,变成了出现频次高的关键词。“重要关键词”为出现频次最高的关键词,级别为0,其次要和重要的划分采用人工标注的方式,但也设置了阈值,用来区分。比如0.2为阈值,当标注为次要的关键词,随着统计频次的增加超过0.2后,则变为重要关键词。初始采用人为标注的方式。Among them, "secondary keywords" are keywords that did not appear very frequently in the first statistics, but with the increase of statistics, the frequency of occurrence may change and become keywords with high frequency of occurrence. "Important keywords" are the keywords with the highest frequency of occurrence, and the level is 0. The classification of secondary and important is manually marked, but a threshold is also set to distinguish them. For example, 0.2 is the threshold value. When the keyword marked as secondary, it will become an important keyword as the statistical frequency increases and exceeds 0.2. Initially, the method of human labeling is adopted.
进一步,步骤3)中每个TEC遍历训练集TRS的过程具体包括:Further, the process of each TEC traversing the training set TRS in step 3) specifically includes:
3-2-1)若TR的KWS包含TEC的第一个关键词,进入步骤3-2-2);否则,计算TEK的下一个TEC并重新开始遍历训练集;3-2-1) If the KWS of TR contains the first keyword of TEC, go to step 3-2-2); otherwise, calculate the next TEC of TEK and start traversing the training set again;
3-2-2)若TR的KWS包含TEC的至少一个重要关键词,将TR的TAG添加到TE的标签集TAGS中。若TAGS中已经包含了该TAG,则将该TAG作为键的键值对所对应的值加1;否则将<TAG,1>键值对添加到TAGS中。3-2-2) If the KWS of the TR contains at least one important keyword of the TEC, add the TAG of the TR to the tag set TAGS of the TE. If the TAG is already included in TAGS, add 1 to the value corresponding to the key-value pair with the TAG as the key; otherwise, add the <TAG,1> key-value pair to TAGS.
进一步地,步骤4)还包括对不包含关键词的测试网页TE(即预分类失败的测试网页)进行分类计算。Further, step 4) also includes performing classification calculation on test web pages TE that do not contain keywords (that is, test web pages that fail pre-classification).
本发明的有益效果是:The beneficial effects of the present invention are:
1.在人工标注训练集的类别标签的同时,给出表征该类别的关键词,然后对所有的关键词进行频次统计,设定阈值,将关键词分为重要的与次要的,得到关键词表。将关键词根据出现频次分为重要和次要两个级别,可以充分利用关键词的频次信息,使测试网页的关键词二元组更能反映网页本身的属性,提高中文网页分类的准确率。1. While manually labeling the category labels of the training set, give the keywords that characterize the category, then perform frequency statistics on all keywords, set thresholds, divide keywords into important and secondary, and get key words glossary. Dividing keywords into important and secondary levels according to frequency of occurrence can make full use of the frequency information of keywords, make the keyword binary groups of the test webpage better reflect the attributes of the webpage itself, and improve the accuracy of Chinese webpage classification.
2.对测试集中的每条测试网页,首先遍历关键词表,得到该网页包含的关键词集;然后求出关键词集的所有二元组,要求二元组中至少有一个关键词为重要的;通过与该关键词对匹配得到的测试网页的候选标签更加准确,同样提高了网页分类结果的准确性。2. For each test webpage in the test set, first traverse the keyword table to obtain the keyword set contained in the webpage; then find all the binary groups of the keyword set, and at least one keyword in the binary group is required to be important The candidate label of the test webpage obtained by matching the keyword pair is more accurate, which also improves the accuracy of the webpage classification result.
3.之后对每个二元组,遍历训练集,若训练网页的关键词集包含了该二元组,则将该训练网页的标签加入到该测试网页的标签集中;最后对测试网页的标签集中的标签进行频次统计,根据需要取频次最高的几个标签,作为测试网页的预分类标签。合理地赋予训练网页多标签提高了分类结果的准确率和召回率。3. Afterwards, for each binary group, traverse the training set, if the keyword set of the training webpage contains the binary group, then add the label of the training webpage to the label set of the test webpage; finally, the label of the test webpage The concentrated tags are used for frequency statistics, and the tags with the highest frequency are selected as the pre-classified tags of the test web page according to the needs. Giving the training web pages multiple labels reasonably improves the accuracy and recall of the classification results.
4.由于训练集数量很少,整个过程所消耗的时间随着测试集大小线性增长。这大大降低了中文网页分类的运行时间,同时提高了分类结果的准确率和召回率。4. Due to the small number of training sets, the time consumed by the whole process increases linearly with the size of the test set. This greatly reduces the running time of Chinese webpage classification, and at the same time improves the accuracy and recall of classification results.
附图说明Description of drawings
图1为训练集与测试集的组成结构图。Figure 1 is a structural diagram of the training set and the test set.
图2为基于关键词匹配进行中文网页预分类的方法的流程图。FIG. 2 is a flowchart of a method for pre-classifying Chinese web pages based on keyword matching.
具体实施方式Detailed ways
以下结合附图对本发明的原理和特征进行描述,所举实例只用于解释本发明,并非用于限定本发明的范围。The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.
现基于SVM技术实现了一个中文网页分类系统。所提供的训练集和测试集分别为TRS和TES,如图1所示,其中TRS分解成若干个TR,TRID是指每一个TR的编号;TES分解成若干个TE,TEID则是指每一个TE的编号,TECS是指二元组TEC的集;KET为关键词表,包括关键词KW及其对应的键值对的值。A Chinese web page classification system is implemented based on SVM technology. The provided training set and test set are TRS and TES respectively, as shown in Figure 1, where TRS is decomposed into several TRs, and TRID refers to the number of each TR; TES is decomposed into several TEs, and TEID refers to the number of each TR. TE number, TECS refers to the set of two-tuple TEC; KET is the keyword table, including the keyword KW and the value of the corresponding key-value pair.
基于关键词匹配进行中文网页预分类的方法,也即通过TRS对TES进行预分类的步骤如图2所示,具体如下:The method of pre-classifying Chinese web pages based on keyword matching, that is, the steps of pre-classifying TES through TRS is shown in Figure 2, and the details are as follows:
步骤1:对训练集TRS中的每一条训练网页TR,标注其类别标签TAG,以及表征该网页类别的关键词集KWS,重复步骤1至所有训练集标注结束;Step 1: For each training webpage TR in the training set TRS, mark its category label TAG and the keyword set KWS representing the category of the webpage, repeat step 1 until all training sets are marked;
步骤2:将所有训练网页的KWS进行去重,存入关键词表KWT,进入步骤3;Step 2: Deduplicate the KWS of all training webpages, store them in the keyword table KWT, and proceed to step 3;
步骤3:对测试集TES中的每一条测试网页TE,遍历KWT,查找TE中包含的关键词,组成关键词集TEK。若TEK不为空,进入步骤4;若TEK为空,进入步骤8;Step 3: For each test web page TE in the test set TES, traverse the KWT, search for the keywords contained in the TE, and form the keyword set TEK. If TEK is not empty, go to step 4; if TEK is empty, go to step 8;
步骤4:计算TEK的第一个二元组TEC,即关键词对,进入步骤5;Step 4: Calculate the first two-tuple TEC of TEK, that is, the keyword pair, and go to step 5;
步骤5:对TEC,遍历训练集TRS,若TR的KWS包含了TEC,则将该TR的TAG转移给该TE,存入到该TE的标签集TAGS中,同时对标签出现次数进行计数。进入步骤6;Step 5: For TEC, traverse the training set TRS, if the KWS of TR includes TEC, transfer the TAG of this TR to the TE, store it in the tag set TAGS of the TE, and count the number of occurrences of the tag. Go to step 6;
步骤6:重复步骤4、5,直至TEK的最后一个二元组;进入步骤7;Step 6: Repeat steps 4 and 5 until the last two-tuple of TEK; go to step 7;
步骤7:将TAGS中的标签按出现次数进行降序排列,取TAGS的top n(n为整数,可以根据需要取1个或多个),作为该TE的预分类标签,进入步骤8;Step 7: Arrange the tags in TAGS in descending order according to the number of occurrences, take the top n of TAGS (n is an integer, you can take 1 or more according to needs), as the pre-classification tag of the TE, and go to step 8;
步骤8:重复步骤3至步骤7,直至最后一条TE预分类结束;对于预分类失败的中文网页,进入到分类计算阶段,计算结束后完成对训练集的分类,否则直接结束预分类。Step 8: Repeat steps 3 to 7 until the last TE pre-classification is completed; for Chinese webpages that fail to pre-classify, enter the classification calculation stage, and complete the classification of the training set after the calculation is completed, otherwise, directly end the pre-classification.
步骤2所述的将所有训练网页的KWS进行去重,存入关键词表KWT的具体步骤如下:The specific steps of deduplicating the KWS of all training webpages described in step 2 and storing them in the keyword table KWT are as follows:
步骤2.1:新建一个映射M,将第一个训练网页TR的KWS每个关键词K作为M的键,相应的值均置为1;Step 2.1: Create a new mapping M, set each keyword K of the KWS of the first training web page TR as the key of M, and set the corresponding values to 1;
步骤2.2:对第二个TR的KWS中的每个关键词K,首先判断M中是否已经包含了K,若存在,将键为K的键值对的值加1;若不存在,则将<K,1>这个键值对加入到M中;Step 2.2: For each keyword K in the KWS of the second TR, first judge whether K has been included in M, if it exists, add 1 to the value of the key-value pair whose key is K; if not, add <K,1> this key-value pair is added to M;
步骤2.3:对剩下的TR,重复步骤2.2,直至最后一个TR;Step 2.3: Repeat step 2.2 for the remaining TRs until the last TR;
步骤2.4:设定一个阈值s,当键值对的值小于s时,将值置为0;否则将键值对的值置为1;Step 2.4: Set a threshold s, when the value of the key-value pair is less than s, set the value to 0; otherwise, set the value of the key-value pair to 1;
步骤4所述的计算TEK的二元组TEC,即关键词对的具体步骤如下:The specific steps of calculating the binary group TEC of TEK described in step 4, that is, the keyword pair are as follows:
步骤4.1:TEK中共有N个关键词K1,K2,……,KN,按照如下的顺序查找关键词对,每一个关键词对都需进入到步骤4.2进行判断:包含K1的关键词对为<K1,K2>,<K1,K3>,……,<K1,KN>,包含K2的关键词对为<K2,K3>,……,<K2,KN>,这样直至包含K(N-1)的关键词对<K(N-1),KN>;Step 4.1: There are N keywords K1, K2, ..., KN in TEK, search keyword pairs according to the following order, each keyword pair needs to enter step 4.2 for judgment: the keyword pair containing K1 is < K1,K2>,<K1,K3>,...,<K1,KN>, the keyword pair containing K2 is <K2,K3>,...,<K2,KN>, so until it contains K(N-1 ) keyword pair <K(N-1), KN>;
步骤4.2:若当前关键词对中至少有一个满足其在M中的对应的值为1,则进入到步骤5;若不满足,回到步骤4.1,查找下一个关键词对。Step 4.2: If at least one of the current keyword pairs satisfies its corresponding value in M, go to step 5; if not, go back to step 4.1 and search for the next keyword pair.
步骤5所述的若TR的KWS包含了TEC,则将该TR的TAG转移给该TE,存入到该TE的标签集TAGS中,同时对标签出现次数进行计数的具体步骤如下:In step 5, if the KWS of TR includes TEC, transfer the TAG of the TR to the TE, store it in the tag set TAGS of the TE, and count the number of occurrences of the tag at the same time. The specific steps are as follows:
步骤5.1:若TR的KWS包含TEC的第一个关键词,进入步骤5.2;否则,进入步骤6;Step 5.1: If the KWS of TR contains the first keyword of TEC, go to step 5.2; otherwise, go to step 6;
步骤5.2:若TR的KWS包含TEC的第二个关键词,将TR的TAG添加到TE的标签集TAGS中。若TAGS中已经包含了该TAG,则将该TAG作为键的键值对所对应的值加1;否则将<TAG,1>键值对添加到TAGS中;Step 5.2: If the KWS of TR contains the second keyword of TEC, add the TAG of TR to the tag set TAGS of TE. If the TAG is already included in the TAGS, add 1 to the value corresponding to the key-value pair with the TAG as the key; otherwise, add the <TAG, 1> key-value pair to the TAGS;
其中步骤5.2只需TE提供两个关键词的原因是通过对人工标注出来的关键词进行计数,设定了合适的阈值,得到了重要关键词,若TE中有至少有一个重要关键词出现在TR中,则认为TR的标签可以转移给TE。The reason why only two keywords need to be provided by TE in step 5.2 is that by counting the manually marked keywords, an appropriate threshold is set, and important keywords are obtained. If there is at least one important keyword in TE that appears in In TR, it is considered that the label of TR can be transferred to TE.
实施例Example
现以7个类别共200条训练集,1000条测试集为例进行说明。Now take 7 categories with 200 training sets and 1000 test sets as an example to illustrate.
给训练集中每个训练网页标注一个标签,并将其中最能表征其类别的3个关键词(作为关键词集)标注出来,都存储到内存中。Mark a label for each training web page in the training set, and mark out the three keywords (as a keyword set) that best characterize its category, and store them in the memory.
将划分重要关键词的阈值设定设为0.2,在经过频次计算后得到包含了30个重要关键词及40个次要关键词的关键词表。The threshold setting for dividing important keywords is set to 0.2, and a keyword table containing 30 important keywords and 40 secondary keywords is obtained after frequency calculation.
对测试集中的每个测试网页,首先遍历关键词表,找出其中包含的关键词,平均情况为3个(不包括不包含关键词的测试网页,这部分网页预分类失败),因此最多有3个满足条件的关键词二元组。For each test webpage in the test set, first traverse the keyword table to find out the keywords contained in it, the average case is 3 (excluding the test webpages that do not contain keywords, and the pre-classification of this part of the webpage fails), so there are at most 3 keyword pairs that meet the conditions.
对每个二元组,遍历训练集,将关键词集中包含了该二元组的训练网页的标签加入到该测试网页的标签集中。最后对标签集根据频次排序,取出现次数最多的前两个标签作为该测试网页的标签。For each two-tuple, the training set is traversed, and the tags of the training webpage containing the two-tuple in the keyword set are added to the label set of the test webpage. Finally, the label set is sorted according to the frequency, and the first two labels with the largest number of occurrences are taken as the labels of the test web page.
本发明所述基于关键词匹配进行中文网页预分类的方法对真实的中文网页进行了测试,最后的分类结果相比不进行预分类的中文网页分类结果而言,准确率和召回率分别至少提高了10%,15%,使整个中文网页分类系统的分类效果达到了预期值。具体的公式如下:The method for pre-classifying Chinese webpages based on keyword matching in the present invention has been tested on real Chinese webpages. Compared with the classification results of Chinese webpages without pre-classification, the accuracy rate and recall rate of the final classification results are at least improved respectively. 10%, 15%, so that the classification effect of the entire Chinese webpage classification system has reached the expected value. The specific formula is as follows:
召回率(Recall)=系统检索到的相关文件/系统所有相关的文件总数Recall rate (Recall) = related files retrieved by the system/total number of all related files in the system
准确率(Precision)=系统检索到的相关文件/系统所有检索到的文件总数Accuracy (Precision) = related files retrieved by the system/total number of files retrieved by the system
原始方法:Recall=25%,Precision=18%;Original method: Recall=25%, Precision=18%;
本文方法:Recall=40%,Precision=28%。The method in this paper: Recall=40%, Precision=28%.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610741134.8A CN106339459B (en) | 2016-08-26 | 2016-08-26 | The method that Chinese web page is presorted is carried out based on Keywords matching |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610741134.8A CN106339459B (en) | 2016-08-26 | 2016-08-26 | The method that Chinese web page is presorted is carried out based on Keywords matching |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106339459A CN106339459A (en) | 2017-01-18 |
| CN106339459B true CN106339459B (en) | 2019-11-26 |
Family
ID=57822407
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610741134.8A Active CN106339459B (en) | 2016-08-26 | 2016-08-26 | The method that Chinese web page is presorted is carried out based on Keywords matching |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106339459B (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107545020A (en) * | 2017-05-10 | 2018-01-05 | 新华三信息安全技术有限公司 | A kind of determination method and device of Web page classifying |
| CN107506472B (en) * | 2017-09-05 | 2020-09-08 | 淮阴工学院 | Method for classifying browsed webpages of students |
| CN108874996B (en) * | 2018-06-13 | 2021-08-24 | 北京知道创宇信息技术股份有限公司 | Web site classification method and device |
| CN113377467B (en) * | 2021-06-29 | 2022-04-01 | 中国平安财产保险股份有限公司 | Information decoupling method and device, server and storage medium |
| CN113934848B (en) * | 2021-10-22 | 2023-04-07 | 马上消费金融股份有限公司 | Data classification method and device and electronic equipment |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101593200A (en) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | Chinese Web Page Classification Method Based on Keyword Frequency Analysis |
| CN101814083A (en) * | 2010-01-08 | 2010-08-25 | 上海复歌信息科技有限公司 | Automatic webpage classification method and system |
| CN104424308A (en) * | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
| CN105512143A (en) * | 2014-09-26 | 2016-04-20 | 中兴通讯股份有限公司 | Method and device for classifying web pages |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5778367A (en) * | 1995-12-14 | 1998-07-07 | Network Engineering Software, Inc. | Automated on-line information service and directory, particularly for the world wide web |
| US20060282416A1 (en) * | 2005-04-29 | 2006-12-14 | William Gross | Search apparatus and method for providing a collapsed search |
| US20090119276A1 (en) * | 2007-11-01 | 2009-05-07 | Antoine Sorel Neron | Method and Internet-based Search Engine System for Storing, Sorting, and Displaying Search Results |
-
2016
- 2016-08-26 CN CN201610741134.8A patent/CN106339459B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101593200A (en) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | Chinese Web Page Classification Method Based on Keyword Frequency Analysis |
| CN101814083A (en) * | 2010-01-08 | 2010-08-25 | 上海复歌信息科技有限公司 | Automatic webpage classification method and system |
| CN104424308A (en) * | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
| CN105512143A (en) * | 2014-09-26 | 2016-04-20 | 中兴通讯股份有限公司 | Method and device for classifying web pages |
Also Published As
| Publication number | Publication date |
|---|---|
| CN106339459A (en) | 2017-01-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114817553B (en) | Knowledge graph construction method, knowledge graph construction system and computing device | |
| CN106649597B (en) | Method for auto constructing is indexed after a kind of books book based on book content | |
| CN107133213B (en) | A method and system for automatic extraction of text summaries based on algorithm | |
| CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
| CN102591988B (en) | Short text classification method based on semantic graphs | |
| CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
| CN102542014B (en) | Image searching feedback method based on contents | |
| CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
| CN101872351B (en) | Method, device for identifying synonyms, and method and device for searching by using same | |
| CN106339459B (en) | The method that Chinese web page is presorted is carried out based on Keywords matching | |
| CN103218444B (en) | Based on semantic method of Tibetan language webpage text classification | |
| CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
| CN108829780B (en) | Text detection method and device, computing equipment and computer readable storage medium | |
| CN108537240A (en) | Commodity image semanteme marking method based on domain body | |
| CN107180026B (en) | A method and device for learning event phrases based on word embedding semantic mapping | |
| US20140032207A1 (en) | Information Classification Based on Product Recognition | |
| CN103617157A (en) | Text similarity calculation method based on semantics | |
| CN104834735A (en) | A method for automatic extraction of document summaries based on word vectors | |
| CN110688836A (en) | An automatic construction method of domain dictionary based on supervised learning | |
| CN103744984B (en) | Method of retrieving documents by semantic information | |
| CN103838833A (en) | Full-text retrieval system based on semantic analysis of relevant words | |
| CN105469096A (en) | Feature bag image retrieval method based on Hash binary code | |
| CN106547864B (en) | A Personalized Information Retrieval Method Based on Query Expansion | |
| CN110633365A (en) | A hierarchical multi-label text classification method and system based on word vectors | |
| CN103559193B (en) | A kind of based on the theme modeling method selecting unit |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |