CN109993216A

CN109993216A - A text classification method based on K nearest neighbors KNN and its equipment

Info

Publication number: CN109993216A
Application number: CN201910178920.5A
Authority: CN
Inventors: 陈海波
Original assignee: Deep Blue Technology Shanghai Co Ltd
Current assignee: Deep Blue Technology Shanghai Co Ltd
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-07-09
Anticipated expiration: 2039-03-11
Also published as: CN109993216B

Abstract

The invention discloses a text classification method and equipment based on K nearest neighbor KNN. It is used to reduce the computational complexity of text classification, more effectively represent text feature information, and improve the accuracy of text classification. The method includes: decomposing text into words, and extracting words representing feature information of the text from the words; encoding the text into a string vector by using the extracted words; calculating the character by using the KNN model The similarity between the string vector and the sample character string vector in the KNN model, according to the similarity and the classification label corresponding to the sample character string vector, determine the classification label of the character string vector and output.

Description

A text classification method based on K nearest neighbors KNN and its equipment

技术领域technical field

本发明涉及人工智能技术领域，尤其涉及一种基于K最近邻KNN(k-NearestNeighbor)的文本分类方法及其设备。The invention relates to the technical field of artificial intelligence, in particular to a text classification method based on K-Nearest Neighbor (k-Nearest Neighbor) and a device thereof.

背景技术Background technique

目前，文本分类是对文本集合按照一定的分类体系或标准进行自动分类标记，属于一种基于分类体系的自动分类。文本分类过程可以理解为根据待分类数据的某些特征，将待分类数据与样本数据进行匹配的过程，一般的，对文本数据进行特征提取并分类有两种方式，如下所示：At present, text classification is to automatically classify and mark a text collection according to a certain classification system or standard, which belongs to an automatic classification based on a classification system. The text classification process can be understood as the process of matching the data to be classified with the sample data according to some characteristics of the data to be classified. Generally, there are two ways to extract and classify text data, as shown below:

一种是，将文本数据中的特征信息编码成数字向量，计算该数字向量与样本数字向量之间的相似度，根据得到的相似度结果确定与该数字向量对应的文本数据的分类结果。One is to encode the feature information in the text data into a digital vector, calculate the similarity between the digital vector and the sample digital vector, and determine the classification result of the text data corresponding to the digital vector according to the obtained similarity result.

但是该方式中，所述数字向量维数较大，至少有几百个维度，因此利用该数字向量计算相似度时，容易导致运算量过大；另外，由于维度多，维数大，导致该数字向量分布稀疏，透明性差，降低了文本分类的准确率。However, in this method, the dimension of the digital vector is relatively large, with at least several hundreds of dimensions. Therefore, when using the digital vector to calculate the similarity, it is easy to cause an excessive amount of computation; in addition, due to the large number of dimensions, the The sparse distribution of digital vectors and poor transparency reduce the accuracy of text classification.

另一种是，将文本数据中的特征信息编码成结构化的形式如编码成一个特征信息表，利用表匹配算法，计算该特征信息表与样本特征信息表之间的相似度，根据得到的相似度结果确定与该特征信息表对应的文本数据的分类结果。The other is to encode the feature information in the text data into a structured form, such as encoding into a feature information table, and use the table matching algorithm to calculate the similarity between the feature information table and the sample feature information table. The similarity result determines the classification result of the text data corresponding to the feature information table.

但是该方式中，表匹配算法的计算性能受噪声的影响不够稳定。However, in this method, the calculation performance of the table matching algorithm is not stable enough due to the influence of noise.

发明内容SUMMARY OF THE INVENTION

本发明提供一种基于KNN的文本分类方法及其设备，通过将文本编码成字符串向量的形式提取文本特征信息，计算字符串向量之间的相似度，减小运算量，改善提取的特征信息分布稀疏的问题，并且字符串向量更具象征性，更透明，能够更有效的表示文本特征信息，便于提高文本分类的准确率。The invention provides a text classification method and equipment based on KNN, which can extract text feature information by encoding text into the form of string vectors, calculate the similarity between the string vectors, reduce the amount of computation, and improve the extracted feature information The problem of sparse distribution, and the string vector is more symbolic and transparent, which can more effectively represent the text feature information, which is convenient to improve the accuracy of text classification.

第一方面，本发明提供一种基于K最近邻KNN的文本分类方法，该方法包括：In a first aspect, the present invention provides a KNN-based text classification method, the method comprising:

将文本分解为单词，从所述单词中提取表示文本的特征信息的单词；Decomposing the text into words, and extracting words representing the feature information of the text from the words;

利用所述提取的单词，将所述文本编码为字符串向量；Using the extracted words, encoding the text into a string vector;

利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度，根据所述相似度以及所述样本字符串向量对应的分类标签，确定所述字符串向量的分类标签并输出。Calculate the similarity between the string vector and the sample string vector in the KNN model by using the KNN model, and determine the similarity of the string vector according to the similarity and the classification label corresponding to the sample string vector. Classify labels and output.

作为一种可选的实施方式，利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度，包括：As an optional implementation manner, using the KNN model to calculate the similarity between the character string vector and the sample character string vector in the KNN model, including:

利用所述KNN模型，采用向量之间的余弦相似性算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度；或者Using the KNN model, the cosine similarity algorithm between vectors is used to calculate the similarity between the character string vector and the sample character string vector in the KNN model; or

利用所述KNN模型，采用通过计算单词间相似度获得字符串向量间相似度的相似矩阵算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。Using the KNN model, a similarity matrix algorithm for obtaining the similarity between string vectors by calculating the similarity between words is used to calculate the similarity between the string vector and the sample string vector in the KNN model.

作为一种可选的实施方式，还包括：As an optional implementation, it also includes:

获取包括多个字符串向量训练样本，以及所述训练样本中字符串向量对应的分类标签；Obtain training samples including multiple string vectors, and the classification labels corresponding to the string vectors in the training samples;

初始化KNN模型的模型参数，将所述训练样本输入所述KNN模型；Initialize the model parameters of the KNN model, and input the training sample into the KNN model;

利用所述KNN模型，采用向量之间的余弦相似性算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度；Utilize the KNN model, adopt the cosine similarity algorithm between vectors, calculate the similarity between the character string vector and the sample character string vector in the KNN model;

根据输出所述训练样本中字符串向量对应的分类标签及训练中字符串向量对应的分类标签，调整当前的KNN模型的模型参数，直至满足预设条件。According to the classification label corresponding to the string vector in the output training sample and the classification label corresponding to the character string vector in the training, the model parameters of the current KNN model are adjusted until the preset conditions are met.

作为一种可选的实施方式，包括：As an optional implementation, including:

基于服务器中语料库中的单词，将所述文本分解为长字符串；decompose the text into long strings based on the words in the corpus in the server;

通过文本分割方法将所述长字符串划分出段落，提取各段落中属于词干的字符，并删除无提取意义的字符。The long character string is divided into paragraphs by a text segmentation method, the characters belonging to the stem in each paragraph are extracted, and the characters with no extraction meaning are deleted.

作为一种可选的实施方式，从所述单词中提取表示文本的特征信息的单词，包括：As an optional implementation, extracting words representing the feature information of the text from the words, including:

根据所述单词中单词的出现频率、语法属性以及位置分布，分别从所述单词中提取表示文本的特征信息的单词。According to the occurrence frequency, grammatical attribute and position distribution of the words in the words, words representing the feature information of the text are respectively extracted from the words.

作为一种可选的实施方式，根据所述单词中单词的出现频率、语法属性以及位置分布，分别从所述单词中提取表示文本的特征信息的单词，包括：As an optional implementation manner, according to the occurrence frequency, grammatical attributes and position distribution of the words in the words, words representing the feature information of the text are respectively extracted from the words, including:

按照所述单词出现频率由高到低的顺序，提取至少一个单词；Extract at least one word according to the order of occurrence frequency of the words from high to low;

根据所述单词的语法属性，按照所述单词中词频-逆文本频率指数TF-IDF加权值由高到低的顺序，提取至少一个单词；According to the grammatical attribute of the word, according to the order of the word frequency-inverse text frequency index TF-IDF weighted value in the word from high to low, extract at least one word;

根据所述单词的位置分布，提取分布在设定段落的单词。According to the position distribution of the words, the words distributed in the set paragraph are extracted.

作为一种可选的实施方式，所述设定段落为第一段落和/或最后段落。As an optional implementation manner, the set paragraph is the first paragraph and/or the last paragraph.

作为一种可选的实施方式，根据所述单词的位置分布，提取分布在设定段落的单词，包括：As an optional implementation manner, according to the position distribution of the words, the words distributed in the set paragraphs are extracted, including:

提取分布在最后段落的最后一个单词、第一段落的第一个单词、第一段落的最后一个单词以及最后段落的第一个单词。Extract the last word distributed in the last paragraph, the first word of the first paragraph, the last word of the first paragraph, and the first word of the last paragraph.

作为一种可选的实施方式，根据所述相似度以及所述样本字符串向量对应的分类标签，确定所述字符串向量的分类标签，包括：As an optional implementation manner, determining the classification label of the character string vector according to the similarity and the classification label corresponding to the sample character string vector, including:

选取与所述字符串向量相似度由高到低对应的预设数量的样本字符串向量；Selecting a preset number of sample string vectors corresponding to the string vector similarity from high to low;

计算所述预设数量的样本字符串向量的抽样概率；calculating the sampling probability of the preset number of sample string vectors;

根据所述抽样概率选取的样本字符串向量对应的分类标签，确定所述字符串向量的分类标签。The classification label of the character string vector is determined according to the classification label corresponding to the sample character string vector selected by the sampling probability.

第二方面，本发明提供一种基于K最近邻KNN的文本分类设备，该设备包括：处理器以及存储器，其中，所述存储器存储有程序代码，当所述程序代码被所述处理器执行时，使得所述处理器用于：In a second aspect, the present invention provides a KNN-based text classification device, the device includes: a processor and a memory, wherein the memory stores a program code, when the program code is executed by the processor , so that the processor is used to:

作为一种可选的实施方式，所述处理器具体用于：As an optional implementation manner, the processor is specifically used for:

作为一种可选的实施方式，所述处理器具体还用于：As an optional implementation manner, the processor is further used for:

第三方面，本发明提供另一种基于K最近邻KNN的文本分类设备，该设备包括：分解模块、编码模块以及分类模块，其中：In a third aspect, the present invention provides another KNN-based text classification device, the device includes: a decomposition module, an encoding module and a classification module, wherein:

分解模块，用于将文本分解为单词，从所述单词中提取表示文本的特征信息的单词；A decomposition module for decomposing text into words, and extracting words representing the feature information of the text from the words;

编码模块，用于利用所述提取的单词，将所述文本编码为字符串向量；an encoding module for encoding the text into a string vector using the extracted words;

分类模块，用于利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度，根据所述相似度以及所述样本字符串向量对应的分类标签，确定所述字符串向量的分类标签并输出。The classification module is used to calculate the similarity between the character string vector and the sample character string vector in the KNN model by using the KNN model, and determine the similarity according to the similarity and the classification label corresponding to the sample character string vector. Describe the categorical label of the string vector and output it.

作为一种可选的实施方式，所述分类模块具体用于：As an optional implementation manner, the classification module is specifically used for:

作为一种可选的实施方式，所述设备还用于：As an optional implementation manner, the device is also used for:

作为一种可选的实施方式，所述分解模块具体用于：As an optional implementation manner, the decomposition module is specifically used for:

第四方面，本发明提供一种计算机存储介质，其上存储有计算机程序，该程序被处理器执行时实现上述第一方面所述方法的步骤。In a fourth aspect, the present invention provides a computer storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the method described in the first aspect.

本发明提供的一种基于KNN的文本分类方法及其设备，具有以下有益效果：A KNN-based text classification method and device thereof provided by the present invention have the following beneficial effects:

用于将文本编码成字符串向量提取文本特征信息，计算该文本字符串向量与样本字符串向量之间的相似度，由于本发明中字符串向量维度较少，能够使字符串之间相似度的运算量减小；并且，由于字符串向量维度少，能够改善提取的特征信息分布稀疏的问题，同时，字符串向量更具象征性，更透明，能够更有效的表示文本特征信息，便于提高文本分类的准确率。It is used to encode the text into a string vector to extract text feature information, and calculate the similarity between the text string vector and the sample string vector. Since the string vector has less dimensions in the present invention, the similarity between the strings can be improved. In addition, due to the small dimension of the string vector, the problem of sparse distribution of the extracted feature information can be improved. At the same time, the string vector is more symbolic and more transparent, and can more effectively represent the text feature information, which is convenient for improving the The accuracy of text classification.

附图说明Description of drawings

图1为本发明实施例提供的一种基于KNN的文本分类方法流程图；1 is a flowchart of a KNN-based text classification method provided by an embodiment of the present invention;

图2为本发明实施例提供的一种基于KNN的文本分类方法具体流程图；2 is a specific flowchart of a KNN-based text classification method provided by an embodiment of the present invention;

图3为本发明实施例提供的一种基于KNN的文本分类设备图；3 is a diagram of a KNN-based text classification device provided by an embodiment of the present invention;

图4为本发明实施例提供的另一种基于KNN的文本分类设备图。FIG. 4 is a diagram of another KNN-based text classification device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作进一步地详细描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例一Example 1

由于传统的KNN文本分类方法需要将文本编码成数值向量后输入KNN模型进行文本分类，但是文本编码成数值向量存在向量维数多、文本特征信息分布稀疏的问题，导致基于KNN对文本编码成的数值向量进行文本分类时，得出的分类结果正确率较低。因此，本发明实施例提供了一种基于KNN的文本分类方法，将文本编码成字符串向量后输入KNN模型，能够快速有效的得到文本分类的结果。Because the traditional KNN text classification method needs to encode the text into a numerical vector and then input it into the KNN model for text classification, but the encoding of the text into a numerical vector has the problems of many vector dimensions and sparse distribution of text feature information, resulting in KNN-based text encoding. When the numerical vector is used for text classification, the classification result obtained has a low accuracy rate. Therefore, the embodiment of the present invention provides a text classification method based on KNN, which can quickly and effectively obtain the result of text classification by encoding text into a string vector and then inputting it into the KNN model.

如图1所示，具体实施步骤如下：As shown in Figure 1, the specific implementation steps are as follows:

步骤10：将文本分解为单词，从所述单词中提取表示文本的特征信息的单词；Step 10: decompose the text into words, and extract words representing the feature information of the text from the words;

上述单词可以是英文单词、中文单词或者能够被计算机处理的任意形式的单个字符，上述将文本分解为单词的具体方式如下：The above-mentioned words can be English words, Chinese words or any single character that can be processed by a computer. The above-mentioned specific methods for decomposing text into words are as follows:

通过文本分割方法将所述长字符串划分出段落，提取各段落中属于词干的字符，并删除无特征提取意义的字符。The long character string is divided into paragraphs by a text segmentation method, the characters belonging to the stem in each paragraph are extracted, and the characters without the meaning of feature extraction are deleted.

上述语料库中存储了海量的单词以及该单词对应的字符，将所述文本中与所述语料库中的单词进行匹配，得到文字中出现在语料库中的单词，将得到的单词对应的字符按照在文中出现的先后顺序连接成一个长字符串，通过文本分割方法将所述长字符串划分成各个段落，根据词干提取规则提取各段落中属于词干的字符，为了提高提取表示文本特征信息的效率，删除无提取意义的字符，例如命题、连接词、代词等。上述词干可以但不限于包括动词、名词和形成词等，经过文本分解的处理，能够将所述文本分解为由动词字符、名词字符和形容词字符组成的字符串。所述文本分割方法为现有技术，这里不再详述。A large number of words and characters corresponding to the words are stored in the above-mentioned corpus, and the words in the text are matched with the words in the corpus to obtain the words that appear in the corpus in the text, and the characters corresponding to the words obtained are as in the text. The sequence of appearance is connected into a long string, and the long string is divided into paragraphs by the text segmentation method, and the characters belonging to the stem in each paragraph are extracted according to the stem extraction rules, in order to improve the extraction efficiency of the text feature information. , removes characters with no extractive meaning, such as propositions, conjunctions, pronouns, etc. The above-mentioned word stems may include, but are not limited to, verbs, nouns, and formative words, etc. After text decomposition, the text can be decomposed into character strings consisting of verb characters, noun characters, and adjective characters. The text segmentation method is in the prior art and will not be described in detail here.

从所述单词中提取表示文本的特征信息的单词，所述特征信息可以但不限于包括：单词的出现频率、语法属性以及位置分布。Words representing feature information of the text are extracted from the words, and the feature information may include, but is not limited to, the occurrence frequency, grammatical attributes, and location distribution of the words.

根据所述单词中单词的出现频率、语法属性以及位置分布，分别从所述单词中提取表示文本的特征信息的单词，包括：According to the occurrence frequency, grammatical attributes and position distribution of the words in the words, words representing the feature information of the text are extracted from the words, including:

具体的，可以提取所述单词中出现频率最高的单词、所述单词中出现频率第二高的单词以及所述单词中出现频率第三高的单词，并且，所述频率最高的单词/频率第二高的单词/所述频率第三高的单词可以是一个，也可以是多个；Specifically, the word with the highest frequency among the words, the word with the second highest frequency among the words, and the word with the third highest frequency among the words can be extracted, and the word with the highest frequency/the word with the highest frequency The word with the second highest frequency/the word with the third highest frequency can be one or more;

所述TF-IDF是一种常用的加权统计方法，TF-IDF加权值越高，说明该单词具有很好的类别区分能力，适合用于文本分类。具体的，可以提取所述单词中TF-IDF加权值最高的单词、所述单词中TF-IDF加权值第二高的单词以及所述单词中TF-IDF加权值第三高的单词。并且，所述TF-IDF加权值最高的单词/所述TF-IDF加权值第二高的单词/所述TF-IDF加权值第三高的单词可以是一个，也可以是多个。The TF-IDF is a commonly used weighted statistical method, and the higher the weighted value of the TF-IDF, the better the classification ability of the word, which is suitable for text classification. Specifically, the word with the highest TF-IDF weight value among the words, the word with the second highest TF-IDF weight value among the words, and the word with the third highest TF-IDF weight value among the words can be extracted. In addition, the word with the highest TF-IDF weight value/the word with the second highest TF-IDF weight value/the word with the third highest TF-IDF weight value may be one or more than one word.

根据所述单词的位置分布，提取分布在设定段落的单词。所述设定段落可以是第一段落，或者是最后段落，或者是第一段落和最后段落。According to the position distribution of the words, the words distributed in the set paragraph are extracted. The set paragraph may be the first paragraph, or the last paragraph, or the first paragraph and the last paragraph.

具体的，根据所述单词的位置分布，提取分布在设定段落的单词，包括：提取分布在最后段落的最后一个单词、第一段落的第一个单词、第一段落的最后一个单词以及最后段落的第一个单词。Specifically, according to the position distribution of the words, extracting the words distributed in the set paragraph includes: extracting the last word distributed in the last paragraph, the first word of the first paragraph, the last word of the first paragraph, and the last word of the last paragraph. first word.

步骤11：利用所述提取的单词，将所述文本编码为字符串向量；Step 11: using the extracted words to encode the text into a string vector;

具体的，利用所述提取的单词，编码的字符串向量由以下字符组成：Specifically, using the extracted words, the encoded string vector consists of the following characters:

所述单词中出现频率最高的单词、所述单词中出现频率第二高的单词、所述单词中出现频率第三高的单词、所述单词中TF-IDF加权值最高的单词、所述单词中TF-IDF加权值第二高的单词、所述单词中TF-IDF加权值第三高的单词、分布在最后段落的最后一个单词、第一段落的第一个单词、第一段落的最后一个单词以及最后段落的第一个单词，根据上述10个字符组成十维字符串向量。The word with the highest frequency in the word, the word with the second highest frequency in the word, the word with the third highest frequency in the word, the word with the highest TF-IDF weighting value in the word, the word in the word The word with the second highest TF-IDF weighting value in the word, the word with the third highest TF-IDF weighting value in the word, the last word distributed in the last paragraph, the first word of the first paragraph, the last word of the first paragraph and the first word of the last paragraph, form a ten-dimensional string vector based on the above 10 characters.

步骤12：利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度，根据所述相似度以及所述样本字符串向量对应的分类标签，确定所述字符串向量的分类标签并输出。Step 12: Calculate the similarity between the character string vector and the sample character string vector in the KNN model by using the KNN model, and determine the character according to the similarity and the classification label corresponding to the sample character string vector. String vector of categorical labels and output.

具体的，利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度，包括以下任一种方式：Specifically, using the KNN model to calculate the similarity between the character string vector and the sample character string vector in the KNN model, including any of the following methods:

方式一：利用所述KNN模型，采用向量之间的余弦相似性算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。Manner 1: Using the KNN model, a cosine similarity algorithm between vectors is used to calculate the similarity between the character string vector and the sample character string vector in the KNN model.

具体的，本发明实施例中的字符串向量是一种有限的字符串，可以根据用户需求定义不同的字符串的长度，字符串向量之间的相似度计算可以采用向量之间的余弦相似性算法，该余弦相似性算法的计算公式如下：Specifically, the character string vector in the embodiment of the present invention is a limited character string, and the lengths of different character strings can be defined according to user requirements, and the cosine similarity between the vectors can be calculated as the similarity between the character string vectors. algorithm, the calculation formula of the cosine similarity algorithm is as follows:

其中，上述公式中str_i是字符串向量，d表示字符串向量中字符的个数，d_1i表示文本标识是d₁的字符串向量str₁中第i个字符，d_2i表示文本标识是d₂的字符串向量str₂中第i个字符。sim(str₁,str₂)为字符串向量str₁和字符串向量str₂之间的余弦相似性。上文本标识d₁可以标识的是待分类的文本，上述文本标识d₂可以标识的是KNN模型中样本字符串向量对应的文本。Among them, str _i in the above formula is a string vector, d represents the number of characters in the string vector, d _1i represents the ith character in the string vector str ₁ whose text identifier is d ₁ , and d _2i represents the text identifier is d The ith character in the string vector str ₂ of ₂ . sim(str ₁ , str ₂ ) is the cosine similarity between the string vector str ₁ and the string vector str ₂ . The above text identifier d ₁ can identify the text to be classified, and the above text identifier d ₂ can identify the text corresponding to the sample character string vector in the KNN model.

方式二：利用所述KNN模型，采用通过计算单词间相似度获得字符串向量间相似度的相似矩阵算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。Method 2: Using the KNN model, a similarity matrix algorithm that obtains the similarity between string vectors by calculating the similarity between words is used to calculate the similarity between the string vector and the sample string vector in the KNN model.

具体的，首先根据服务器获取的语料库能够构建相似矩阵，所述相似矩阵中每一行及每一列中的每一项都对应所述语料库中的两个单词，其中，所述相似矩阵中的每一项内容表示所述语料库中任意两个单词之间的相似度。Specifically, first, a similarity matrix can be constructed according to the corpus obtained by the server, and each item in each row and each column of the similarity matrix corresponds to two words in the corpus, wherein each item in the similarity matrix Item content represents the similarity between any two words in the corpus.

假设所述语料库中包含N个文本，任意选取两个文本，并从所述两个文本中分别选取一个单词，计算两个文本中选取的两个单词之间的相似性，将所述两个单词之间的相似性量化为0和1之间的量化值，如果所述两个文本完全相同，则计算出的相似矩阵中每一项的相似度为1，如果两个文档完全不同，则计算出的相似矩阵中每一项的相似度为0。Assuming that the corpus contains N texts, two texts are arbitrarily selected, a word is selected from the two texts, the similarity between the two words selected from the two texts is calculated, and the two The similarity between words is quantified as a quantified value between 0 and 1. If the two texts are exactly the same, the similarity of each item in the calculated similarity matrix is 1. If the two documents are completely different, then The similarity of each item in the calculated similarity matrix is 0.

其中，计算上述两个单词之间的相似度的公式如下：Among them, the formula for calculating the similarity between the above two words is as follows:

其中，上述公式中的T_i和T_j表示所述语料库中两个不同的文本，t_i为文本T_i中的单词，t_j为文本T_j中的单词，所述T_i和T_j中单词的数量都为N，则有0＜i,j≤N，且i，j均为整数。Wherein, T _i and T _j in the above formula represent two different texts in the corpus, t _i is a word in the text T _i , t _j is a word in the text T _j , and in the T _i and T _j If the number of words is N, then 0<i, j≤N, and i, j are integers.

在得到上述两个文本中任意两个单词之间相似度后，可以基于得到的相似度，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。After obtaining the similarity between any two words in the above two texts, the similarity between the character string vector and the sample character string vector in the KNN model can be calculated based on the obtained similarity.

进一步地，可以采用余弦相似性运算，基于得到的相似度，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。因此，本发明实施例结合方式一、方式二，提供如下方式：Further, a cosine similarity operation may be used to calculate the similarity between the character string vector and the sample character string vector in the KNN model based on the obtained similarity. Therefore, the embodiment of the present invention provides the following methods in combination with the first and second modes:

首先利用所述KNN模型，采用通过计算单词间相似度获得字符串向量间相似度的相似矩阵算法，得到相似矩阵，基于所述相似矩阵中任一项相似度，利用向量之间的余弦相似性算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。First, using the KNN model, a similarity matrix algorithm that obtains the similarity between string vectors by calculating the similarity between words is used to obtain a similarity matrix. Based on the similarity of any item in the similarity matrix, the cosine similarity between vectors is used. The algorithm calculates the similarity between the character string vector and the sample character string vector in the KNN model.

本发明实施例不仅利用所述KNN模型对所述字符串向量进行分类，而且还利用大量的字符串向量训练样本，对所述KNN模型进行训练，具体如下：The embodiment of the present invention not only uses the KNN model to classify the string vector, but also uses a large number of string vector training samples to train the KNN model, as follows:

1)获取包括多个字符串向量训练样本，以及所述训练样本中字符串向量对应的分类标签；1) obtaining training samples including multiple string vectors, and the classification labels corresponding to the string vectors in the training samples;

具体的，字符串向量的分类标签分为两类，正类和负类。其中，正类和负类对应的文本的分类类别不同，用户可以根据需求确定所述正类标签对应的文本的分类类别，以及所述负类标签对应的文本的分类类别。本发明实施例中对文本具体的分类类别的数量及分类的具体类别不作具体限定，可以根据实际需求设定。Specifically, the classification labels of string vectors are divided into two categories, positive and negative. Wherein, the classification categories of the text corresponding to the positive class and the negative class are different, and the user can determine the classification class of the text corresponding to the positive class label and the classification class of the text corresponding to the negative class label according to requirements. In the embodiment of the present invention, the number of specific classification categories and the specific categories of the text are not specifically limited, and may be set according to actual needs.

2)初始化KNN模型的模型参数，将所述训练样本输入所述KNN模型；2) initialize the model parameters of the KNN model, and input the training sample into the KNN model;

具体的，在KNN模型建立时，KNN模型中存储了多个样本字符串向量以及对应的分类标签，以用于对KNN模型的模型参数进行训练，同时，KNN模型中还可以存储上述语料库，即基于KNN模型中存储的语料库中的单词，将所述文本分解为长字符串。Specifically, when the KNN model is established, multiple sample string vectors and corresponding classification labels are stored in the KNN model for training the model parameters of the KNN model. At the same time, the KNN model can also store the above corpus, that is The text is broken down into long strings based on the words in the corpus stored in the KNN model.

3)利用所述KNN模型，采用向量之间的余弦相似性算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度；3) utilize the KNN model, adopt the cosine similarity algorithm between the vectors, calculate the similarity between the character string vector and the sample character string vector in the KNN model;

作为一种可选的实施方式，利用所述KNN模型，采用通过计算单词间相似度获得字符串向量间相似度的相似矩阵算法，计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度。As an optional embodiment, the KNN model is used, and the similarity matrix algorithm for obtaining the similarity between string vectors by calculating the similarity between words is used to calculate the difference between the string vector and the sample string vector in the KNN model. similarity between.

4)根据输出所述训练样本中字符串向量对应的分类标签及训练中字符串向量对应的分类标签，调整当前的KNN模型的模型参数，直至满足预设条件。4) According to the classification label corresponding to the character string vector in the output training sample and the classification label corresponding to the character string vector in the training, the model parameters of the current KNN model are adjusted until the preset conditions are met.

所述满足的预设条件可以是调整当前的KNN模型的模型参数，直至KNN模型输出的训练样本中字符串向量的分类标签与所述训练样本中预先获取的字符串向量对应的分类标签的分类准确率满足预设条件。The preset condition that is satisfied may be adjusting the model parameters of the current KNN model until the classification label of the character string vector in the training sample output by the KNN model and the classification label corresponding to the character string vector pre-obtained in the training sample. The accuracy rate meets the preset conditions.

作为一种可选的实施方式，根据所述相似度以及所述样本字符串向量对应的分类标签，确定所述字符串向量的分类标签，包括以下步骤：As an optional implementation manner, determining the classification label of the character string vector according to the similarity and the classification label corresponding to the sample character string vector includes the following steps:

1)选取与所述字符串向量相似度由高到低对应的预设数量的样本字符串向量；1) select a preset number of sample string vectors corresponding to the string vector similarity from high to low;

2)计算所述预设数量的样本字符串向量的抽样概率；2) calculating the sampling probability of the preset number of sample character string vectors;

具体的，可以在初始化KNN模型的模型参数时，预先设置所述各个样本字符串向量的抽样概率，在利用训练样本对KNN模型进行训练时，根据所述各个样本字符串向量被抽中的概率，对所述预先设置的抽样概率进行调整，当KNN完成训练时，所述抽样概率调整完成。Specifically, when initializing the model parameters of the KNN model, the sampling probability of each sample string vector can be preset, and when the KNN model is trained by using the training samples, the sampling probability of each sample string vector can be determined according to the sampling probability. , the preset sampling probability is adjusted, and when the KNN completes training, the sampling probability adjustment is completed.

3)根据所述抽样概率选取的样本字符串向量对应的分类标签，确定所述字符串向量的分类标签。3) Determine the classification label of the character string vector according to the classification label corresponding to the sample character string vector selected according to the sampling probability.

本发明实施例中的KNN是通过计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度进行文本分类，根据所述预设数量的样本字符串向量的抽样概率来选取样本字符串向量对应的分类标签，具体的，利用KNN对字符串向量进行文本分类时，如果一个字符串向量在k个最相似的样本字符串向量中的预设数量内属于某一个类别，则该字符串向量也属于这个类别，否则选取与k个最相似的样本字符串向量中相似度最高的样本字符串向量对应的分类标签。The KNN in the embodiment of the present invention performs text classification by calculating the similarity between the character string vector and the sample character string vector in the KNN model, and selects samples according to the sampling probability of the preset number of sample character string vectors. The classification label corresponding to the string vector. Specifically, when KNN is used to classify the text of the string vector, if a string vector belongs to a certain category within the preset number of the k most similar sample string vectors, the The string vector also belongs to this category, otherwise the classification label corresponding to the sample string vector with the highest similarity among the k most similar sample string vectors is selected.

但本发明实施例中，通过采用向量之间的余弦相似性算法，以及所述相似矩阵算法，一方面，能够使得字符串之间相似度的运算量减小，并且，由于字符串向量维度少，能够改善提取的特征信息分布稀疏的问题；另一方面，字符串向量更具象征性，更透明，能够更有效的表示文本特征信息，便于提高文本分类的准确率。However, in the embodiment of the present invention, by using the cosine similarity algorithm between vectors and the similarity matrix algorithm, on the one hand, the computational complexity of the similarity between strings can be reduced, and because the string vector has less dimensions , which can improve the problem of sparse distribution of extracted feature information; on the other hand, string vectors are more symbolic and transparent, and can more effectively represent text feature information, which is convenient to improve the accuracy of text classification.

如图2所示，基于KNN的文本分类的具体步骤如下：As shown in Figure 2, the specific steps of KNN-based text classification are as follows:

步骤20：将文本编码为字符串向量，输入KNN模型；Step 20: Encode the text into a string vector and input it into the KNN model;

步骤21：利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度；Step 21: utilize the KNN model to calculate the similarity between the character string vector and the sample character string vector in the KNN model;

步骤22：按照与所述字符串向量相似度的递增关系将对应的样本字符串向量进行排序；Step 22: Sort the corresponding sample string vectors according to the increasing relationship with the similarity of the string vectors;

步骤23：根据排序的结果，按字符串向量相似度从高到低的顺序选取所述样本字符串向量中的K个样本字符串向量，其中，所述K为正整数；Step 23: According to the sorting result, select K sample character string vectors in the sample character string vectors according to the sequence of the similarity of character string vectors from high to low, wherein, the K is a positive integer;

步骤24：计算所述K个样本字符串向量的抽样概率；Step 24: Calculate the sampling probability of the K sample string vectors;

步骤25：根据所述抽样概率选取样本字符串向量；Step 25: select a sample string vector according to the sampling probability;

步骤26：将所述选取的样本字符串向量对应的分类标签，作为所述字符串向量的分类标签。Step 26: Use the classification label corresponding to the selected sample character string vector as the classification label of the character string vector.

实施例二Embodiment 2

基于同一发明构思，本发明实施例还提一种基于K最近邻KNN的文本分类设备，如图3所示，该设备包括：处理器30以及存储器31，其中，所述存储器存储有程序代码，当所述程序代码被所述处理器执行时，使得所述处理器30用于：Based on the same inventive concept, an embodiment of the present invention also provides a text classification device based on K nearest neighbors KNN. As shown in FIG. 3 , the device includes: a processor 30 and a memory 31, wherein the memory stores program codes, The program code, when executed by the processor, causes the processor 30 to:

作为一种可选的实施方式，所述处理器30具体用于：As an optional implementation manner, the processor 30 is specifically configured to:

作为一种可选的实施方式，所述处理器30具体还用于：As an optional implementation manner, the processor 30 is further configured to:

实施例三Embodiment 3

本发明提供另一种基于K最近邻KNN的文本分类设备，如图4所示，该设备包括：分解模块40、编码模块41以及分类模块42，其中：The present invention provides another text classification device based on K nearest neighbors KNN. As shown in FIG. 4, the device includes: a decomposition module 40, an encoding module 41 and a classification module 42, wherein:

分解模块40，用于将文本分解为单词，从所述单词中提取表示文本的特征信息的单词；The decomposition module 40 is used for decomposing the text into words, and extracting the words representing the feature information of the text from the words;

编码模块41，用于利用所述提取的单词，将所述文本编码为字符串向量；an encoding module 41, for using the extracted word to encode the text into a string vector;

分类模块42，用于利用所述KNN模型计算所述字符串向量与KNN模型中的样本字符串向量之间的相似度，根据所述相似度以及所述样本字符串向量对应的分类标签，确定所述字符串向量的分类标签并输出。The classification module 42 is used to calculate the similarity between the character string vector and the sample character string vector in the KNN model by using the KNN model, and determine the similarity according to the similarity and the classification label corresponding to the sample character string vector. The categorical label of the string vector and output.

作为一种可选的实施方式，所述分类模块42具体用于：As an optional implementation manner, the classification module 42 is specifically used for:

作为一种可选的实施方式，所述分解模块40具体用于：As an optional implementation manner, the decomposition module 40 is specifically used for:

实施例四Embodiment 4

本发明提供一种计算机存储介质，其上存储有计算机程序，该程序被处理器执行时实现如下步骤：The present invention provides a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的设备。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce A device that implements the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令设备的制造品，该指令设备实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising the instruction apparatus, the instructions The device implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention. Thus, provided that these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.

Claims

1. a kind of file classification method based on K arest neighbors KNN, which is characterized in that this method comprises:

Text is decomposed into word, the word for indicating the characteristic information of text is extracted from the word；

It is character string vector by the text code using the word of the extraction；

The similarity between the sample character string vector in the character string vector and KNN model is calculated using the KNN model, According to the similarity and the corresponding tag along sort of the sample character string vector, the contingency table of the character string vector is determined It signs and exports.

2. the method according to claim 1, wherein using the KNN model calculate the character string vector with The similarity between sample character string vector in KNN model, comprising:

The character string vector and KNN model are calculated using the cosine similarity algorithm between vector using the KNN model In sample character string vector between similarity；Or

Using the KNN model, using the similar matrix for obtaining similarity between character string vector by similarity between calculating word Algorithm calculates the similarity between the sample character string vector in the character string vector and KNN model.

3. method according to claim 1 or 2, which is characterized in that further include:

Obtaining includes the corresponding contingency table of character string vector in multiple character string vector training samples and the training sample Label；

The training sample is inputted the KNN model by the model parameter for initializing KNN model；

The character string vector and KNN model are calculated using the cosine similarity algorithm between vector using the KNN model In sample character string vector between similarity；

According to corresponding point of character string vector in exporting the corresponding tag along sort of character string vector in the training sample and training Class label adjusts the model parameter of current KNN model, until meeting preset condition.

4. the method according to claim 1, wherein text is decomposed into word, comprising:

Based on the word in corpus in server, the text is decomposed into long character string；

The long character string is marked off into paragraph by text segmenting method, extracts the character for belonging to stem in each paragraph, and delete Except without the character for extracting meaning.

5. the method according to claim 1, wherein extracting the characteristic information for indicating text from the word Word, comprising:

According to the frequency of occurrences of word, grammatical attribute and position distribution in the word, table is extracted from the word respectively Show the word of the characteristic information of text.

6. according to the method described in claim 5, it is characterized in that, according to the frequency of occurrences of word, grammer category in the word Property and position distribution, respectively from the word extract indicate text characteristic information word, comprising:

According to the sequence of the word frequency of occurrences from high to low, at least one word is extracted；

According to the grammatical attribute of the word, according to word frequency in the word-inverse document frequency TF-IDF weighted value by height To low sequence, at least one word is extracted；

According to the position distribution of the word, the word for being distributed in setting paragraph is extracted.

7. according to the method described in claim 6, it is characterized in that, the paragraph that sets is the first paragraph and/or last paragraph.

8. the method according to the description of claim 7 is characterized in that extraction, which is distributed in, to be set according to the position distribution of the word Determine the word of paragraph, comprising:

Extract the last one for being distributed in the last one word of last paragraph, the first word of the first paragraph, the first paragraph The first word of word and last paragraph.

9. the method according to claim 1, wherein according to the similarity and the sample character string vector Corresponding tag along sort determines the tag along sort of the character string vector, comprising:

It chooses and the sample character string vector of the character string vector similarity corresponding preset quantity from high to low；

Calculate the sampling probability of the sample character string vector of the preset quantity；

According to the corresponding tag along sort of sample character string vector that the sampling probability is chosen, point of the character string vector is determined Class label.

10. a kind of text classification equipment based on K arest neighbors KNN, which is characterized in that the equipment includes: processor and storage Device, wherein the memory is stored with program code, when said program code is executed by the processor, so that the place Manage the step of device perform claim requires 1~9 any the method.

11. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processor The step of Shi Shixian such as claim 1~9 any the method.