CN109408641A

CN109408641A - It is a kind of based on have supervision topic model file classification method and system

Info

Publication number: CN109408641A
Application number: CN201811398232.1A
Authority: CN
Inventors: 唐焕玲; 窦全胜; 于立萍; 宋英杰; 鲁眀羽
Original assignee: Shandong Technology and Business University
Current assignee: Shandong Technology and Business University
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-03-01
Anticipated expiration: 2038-11-22
Also published as: CN109408641B

Abstract

The present disclosure provides a text classification method and system based on a supervised topic model. Among them, a text classification method based on a supervised topic model includes: constructing a SLDA-TC text classification model; in the process of training the SLDA-TC text classification model, according to the SLDA-TC-Gibbs algorithm, the implicit classification of each word is performed. The topic is sampled, and the implicit topic is only sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, the text-topic probability distribution, topic probability distribution, topic ‑Word probability distribution and topic‑category probability distribution; establish accurate mapping between topics and categories; input the text to be tested into the SLDA‑TC text classification model generated by training, infer the topic of the text to be tested, and then predict the category of the text .

Description

A text classification method and system based on supervised topic model

技术领域technical field

本公开涉及数据分类领域，尤其涉及一种基于有监督主题模型的文本分类方法及系统。The present disclosure relates to the field of data classification, and in particular, to a text classification method and system based on a supervised topic model.

背景技术Background technique

本部分的陈述仅仅是提供了与本公开相关的背景技术信息，不必然构成在先技术。The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art.

文本表示是文本挖掘的重要步骤，目前最广泛的文本表示方法是词袋法 (Bag-of-word,BOW)。词袋法将一篇文本看作是词的集合，并假设每个词的出现是独立的，不依赖于其它词，且忽略词序、句法等信息。基于BOW,一篇文本用一个n维向量表示，每一维对应一个词，通常是该词的频度相关的权重，这就是最常用的是向量空间模型(vectorspace model,VSM)。由于自然语言的复杂性，文本表示存在“维数灾难”、“稀疏性”、“语义丢失”等诸多问题。词袋法忽略词序、句法等信息，使得词的语义信息难以抽取和量化，文本的语义表示目前仍是非常困难的。Text representation is an important step in text mining, and the most widely used text representation method is Bag-of-word (BOW). The bag-of-words method regards a text as a collection of words, and assumes that the occurrence of each word is independent, does not depend on other words, and ignores information such as word order and syntax. Based on BOW, a text is represented by an n-dimensional vector, each dimension corresponds to a word, usually the frequency-related weight of the word, which is the most commonly used vector space model (VSM). Due to the complexity of natural language, text representation has many problems such as "curse of dimensionality", "sparseness", and "semantic loss". The bag-of-words method ignores word order, syntax and other information, making it difficult to extract and quantify the semantic information of words, and the semantic representation of text is still very difficult.

Mikolov等人提出的word2vec模型，是一种词向量的训练方法，利用词的上下文信息将一个词转化成一个低维实数向量，越相似的词在向量空间中越接近。 word2vec模型训练输出的是每个词的词向量，文本的所有词的词向量形成文本向量。基于word2vec模型训练的词向量文本输入深度神经网络，成功用于中文分词、POS tagging、情感分类、句法依存关系等方面。word2vec模型能够解决 “稀疏性”问题，虽然word2vec能够量化词与词的相似度，但并不能解决文本的“语义丢失”和“维度灾难”问题。The word2vec model proposed by Mikolov et al. is a word vector training method, which uses the context information of the word to convert a word into a low-dimensional real number vector. The more similar words are in the vector space, the closer they are. The word2vec model training output is the word vector of each word, and the word vector of all words of the text forms the text vector. The word vector text based on the word2vec model training is input into the deep neural network, which has been successfully used in Chinese word segmentation, POS tagging, sentiment classification, syntactic dependencies, etc. The word2vec model can solve the "sparseness" problem. Although word2vec can quantify the similarity between words, it cannot solve the "semantic loss" and "dimension disaster" problems of text.

主题模型(topic model)是可用于解决“维度灾难”、“稀疏性”的一种方法，而且能够在一定程度上抽取词的语义信息。主题模型起源于隐性语义索引(Latent SemanticIndexing,LSI)，以及由Hofmann提出的概率隐性语义索引(probabilistic LatentSemantic Indexing,pLSI)。在pLSI基础上，Blei等人提出了LDA(Latent DirichletAllocation)主题模型。LDA中主题看作是词的概率分布，语义相近的词，通过隐含主题建立关联，能够从文本中抽取出语义信息，将文本表示从高维词空间变换到低维主题空间。主题模型直接或扩展使用在自然语言处理领域，如聚类和分类、词义消歧、情感分析等，图像处理领域的目标发现与定位、图像分割等任务。The topic model is a method that can be used to solve the "curse of dimensionality" and "sparseness", and can extract the semantic information of words to a certain extent. Topic models originated from Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) proposed by Hofmann. On the basis of pLSI, Blei et al. proposed the LDA (Latent DirichletAllocation) topic model. The topic in LDA is regarded as the probability distribution of words. Words with similar semantics can be associated through implicit topics, and semantic information can be extracted from the text, and the text representation can be transformed from a high-dimensional word space to a low-dimensional topic space. Topic models are used directly or by extension in the field of natural language processing, such as clustering and classification, word sense disambiguation, sentiment analysis, etc., target discovery and localization in the field of image processing, image segmentation and other tasks.

LDA主题模型将文本表示从高维的词空间变换到低维的主题空间，然后采用KNN、Naive Bayesian、SVM等算法直接分类，其效果并不好。原因在于LDA 主题模型是无监督学习，不考虑文本的类别，并没有利用训练文本已标注的类别这一重要信息。The LDA topic model transforms the text representation from a high-dimensional word space to a low-dimensional topic space, and then uses KNN, Naive Bayesian, SVM and other algorithms for direct classification, and the effect is not good. The reason is that the LDA topic model is unsupervised learning, does not consider the category of the text, and does not take advantage of the important information of the category that the training text has annotated.

现有的改进方法，如Li等人提出了Labled-LDA模型，发明人发现该模型针对每类文档训练一个LDA模型，需要估计的参数增加了多倍，增加了模型的复杂性。The existing improved methods, such as Li et al. proposed the Labeled-LDA model, the inventor found that this model trains an LDA model for each type of document, and the parameters that need to be estimated increase many times, increasing the complexity of the model.

发明内容SUMMARY OF THE INVENTION

根据本公开的一个或多个实施例的一个方面，提供一种基于有监督主题模型的文本分类方法，其能够识别主题-类别之间的语义关系，建立主题与类别的精确映射。According to an aspect of one or more embodiments of the present disclosure, there is provided a text classification method based on a supervised topic model, which can identify the semantic relationship between topics and categories, and establish an accurate mapping between topics and categories.

本公开的一个或多个实施例，提供的一种基于有监督主题模型的文本分类方法，包括：One or more embodiments of the present disclosure provide a text classification method based on a supervised topic model, including:

构建SLDA-TC文本分类模型，SLDA-TC文本分类模型的训练文档集的每个文档带有类别标签；SLDA-TC文本分类模型中需要估计的参数不仅包括文本- 主题概率分布、主题-词概率分布，还包括主题-类别概率分布；Build the SLDA-TC text classification model. Each document in the training document set of the SLDA-TC text classification model has a class label; the parameters that need to be estimated in the SLDA-TC text classification model include not only text-topic probability distribution, topic-word probability distributions, including topic-category probability distributions;

训练SLDA-TC文本分类模型，按照SLDA-TC-Gibbs算法进行SLDA-TC模型参数估计；其中，按照SLDA-TC-Gibbs算法进行SLDA-TC模型参数估计的过程为：对每个词的隐含主题进行采样，且只从与该词所在文本类别标签相同的其它训练文本中进行隐含主题采样；在确定每个词的隐含主题之后，通过统计主题-词、文档-主题、主题-类别的频次，计算得到文本-主题概率分布、主题-词概率分布和主题-类别概率分布，进而建立出主题与类别之间的准确映射；The SLDA-TC text classification model is trained, and the parameters of the SLDA-TC model are estimated according to the SLDA-TC-Gibbs algorithm. The topic is sampled, and only the implicit topic is sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, by statistical topic-word, document-topic, topic-category The frequency of text-topic, topic-word probability distribution and topic-category probability distribution are calculated, and then an accurate mapping between topics and categories is established;

待测文本主题推断和分类；将待测文本输入至训练完成的SLDA-TC文本分类模型，首先对待测文档每个词进行隐含主题采样；然后推断待测文本的主题概率分布；根据待测文档的主题分布和SLDA-TC模型的主题-类别分布，输出待测文本的类别标签。Inference and classification of the subject of the text to be tested; input the text to be tested into the trained SLDA-TC text classification model, first perform implicit topic sampling for each word of the document to be tested; then infer the subject probability distribution of the text to be tested; The topic distribution of the document and the topic-category distribution of the SLDA-TC model, output the category label of the text to be tested.

在一个或多个实施例中，文本-主题概率分布、主题-词概率分布和主题-类别概率分布均服从Dirichlet分布。In one or more embodiments, the text-topic probability distribution, topic-word probability distribution, and topic-category probability distribution all obey the Dirichlet distribution.

在一个或多个实施例中，通过多次迭代训练生成用于分类的SLDA-TC模型，迭代结束，通过JS散度评估主题之间的相似度、通过SLDA-TC的主题-类别分布参数评估主题与类别之间的语义相关度。In one or more embodiments, the SLDA-TC model for classification is generated through multiple iterations of training, and the iteration is over, and the similarity between topics is evaluated by JS divergence, and the topic-category distribution parameter of SLDA-TC is evaluated. Semantic relevance between topics and categories.

在一个或多个实施例中，SLDA-TC文本分类模型的分类结果的评价指标，包括宏平均分类精度(Macro-precision)、宏平均召回率(Macro-Recall)和宏平均F1值(Macro-F1)。In one or more embodiments, the evaluation indicators of the classification results of the SLDA-TC text classification model include macro-average classification precision (Macro-precision), macro-average recall rate (Macro-Recall), and macro-average F1 value (Macro-F1 ).

本公开的一个或多个实施例，还提供了一种文本分类系统，包括文本输入装置、控制器和显示装置，所述控制器包括存储器和处理器，所述存储器存储有计算机程序，所述程序被处理器执行时能够实现以下步骤：One or more embodiments of the present disclosure further provide a text classification system, including a text input device, a controller and a display device, the controller includes a memory and a processor, the memory stores a computer program, the A program can perform the following steps when executed by a processor:

本公开的有益效果是：The beneficial effects of the present disclosure are:

本公开的文本分类方法和系统，通过构建及训练完成的SLDA-TC文本分类模型，利用文本-主题概率分布、主题-词概率分布和主题-类别概率分布，抽取词与主题、文档与主题、主题与类别之间隐含的语义信息映射，而且主题数量K 只需取略大于类别数量C，不仅提高了文本分类精度，而且能够提高时间效率。The text classification method and system of the present disclosure, by constructing and training the completed SLDA-TC text classification model, utilizes text-topic probability distribution, topic-word probability distribution and topic-category probability distribution to extract words and topics, documents and topics, The implicit semantic information mapping between topics and categories, and the number of topics K only needs to be slightly larger than the number of categories C, which not only improves the accuracy of text classification, but also improves time efficiency.

附图说明Description of drawings

构成本公开的一部分的说明书附图用来提供对本公开的进一步理解，本公开的示意性实施例及其说明用于解释本公开，并不构成对本公开的不当限定。The accompanying drawings forming a part of the present disclosure are used to provide a further understanding of the present disclosure, and the exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure.

图1为本公开的一种SLDA-TC文本分类方法流程图。FIG. 1 is a flowchart of an SLDA-TC text classification method disclosed in the present disclosure.

图2为LDA主题模型。Figure 2 shows the LDA topic model.

图3为SLDA-TC文本分类模型。Figure 3 shows the SLDA-TC text classification model.

图4(a)为20news-rec数据集C＝4，K＝8时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-Precision比较。Figure 4(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the 20news-rec data set C=4, K=8.

图4(b)为20news-rec数据集C＝4，K＝8时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-Recall比较。Figure 4(b) shows the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the 20news-rec data set C=4, K=8.

图4(c)为20news-rec数据集C＝4，K＝8时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-F1比较。Figure 4(c) is the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the 20news-rec data set C=4, K=8.

图5(a)为sogou数据集C＝5，K＝8时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-Precision比较。Figure 5(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the sogou dataset is C=5 and K=8.

图5(b)为sogou数据集C＝5，K＝8时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-Recall比较。Figure 5(b) shows the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC and SVM when the sogou data set C=5 and K=8.

图5(c)为sogou数据集C＝5，K＝8时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-F1比较。Figure 5(c) shows the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC and SVM when the sogou dataset C=5 and K=8.

图6(a)为20news-sci数据集C＝4，当K＝90时，SLDA-TC与LDA-TC、SVM 的分类结果的Macro-Precision比较。Figure 6(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC, and SVM for the 20news-sci dataset C=4, when K=90.

图6(b)为20news-sci数据集C＝4，当K＝90时，SLDA-TC与LDA-TC、SVM 的分类结果的Macro-Recall比较。Figure 6(b) shows the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC, and SVM for the 20news-sci dataset C=4, when K=90.

图6(c)为20news-sci数据集C＝4，当K＝90时，SLDA-TC与LDA-TC、SVM 的分类结果的Macro-F1比较。Figure 6(c) shows the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC and SVM when K=90 for the 20news-sci dataset C=4.

图7(a)为20news-talk数据集C＝3，K＝90时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-Precision比较。Figure 7(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC and SVM when the 20news-talk dataset C=3, K=90.

图7(b)为20news-talk数据集C＝3，K＝90时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-Recall比较。Figure 7(b) is the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC and SVM when the 20news-talk dataset C=3, K=90.

图7(c)为20news-talk数据集C＝3，K＝90时，SLDA-TC与LDA-TC、SVM的分类结果的Macro-F1比较。Figure 7(c) is the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC and SVM when the 20news-talk dataset C=3, K=90.

具体实施方式Detailed ways

应该指出，以下详细说明都是例示性的，旨在对本公开提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本公开所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本公开的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the exemplary embodiments in accordance with the present disclosure. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, and it should also be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

术语解释：Terminology Explanation:

Dirichlet分布：狄利克雷分布是一组连续多变量的概率分布，是多变量普遍化的Β分布，狄利克雷分布常作为贝叶斯统计的先验概率。Dirichlet distribution: Dirichlet distribution is a group of continuous multivariate probability distributions, and is a multivariate generalized β distribution. Dirichlet distribution is often used as the prior probability of Bayesian statistics.

Gibbs Sampling方法：吉布斯采样，是基于马尔科夫蒙特卡洛(MCMC)的一种算法，用于在难以直接采样时从某一多变量概率分布中近似抽取样本序列。Gibbs Sampling: Gibbs sampling, an algorithm based on Markov Monte Carlo (MCMC), is used to approximate sample sequences from a multivariate probability distribution when direct sampling is difficult.

LDA-TC：是指先通过LDA主题模型抽取一定数目的主题，然后根据LDA模型进行分类的方法。LDA-TC: refers to the method of first extracting a certain number of topics through the LDA topic model, and then classifying them according to the LDA model.

SVM：Support Vector Machine，指的是支持向量机，是常见的一种判别方法。在机器学习领域，是一个有监督的学习模型，通常用来进行模式识别、分类以及回归分析。SVM: Support Vector Machine, which refers to the support vector machine, is a common discrimination method. In the field of machine learning, it is a supervised learning model, which is usually used for pattern recognition, classification and regression analysis.

Macro-Precision：宏平均分类精度。Macro-Precision: Macro average classification accuracy.

Macro-Recall：宏平均召回率。Macro-Recall: Macro average recall.

Macro-F1：宏平均F1值。Macro-F1: Macro average F1 value.

图1为本公开的一种基于有监督主题模型的文本分类方法流程图。FIG. 1 is a flowchart of a text classification method based on a supervised topic model according to the present disclosure.

如图1所示，本实施例的一种文本分类方法，包括：As shown in FIG. 1, a text classification method of this embodiment includes:

S110：构建SLDA-TC文本分类模型；不同于无监督的LDA模型，SLDA-TC 文本分类模型的训练文档集的每个文档带有类别标签；SLDA-TC文本分类模型中待估计的参数包括文本-主题概率分布、主题-词概率分布和主题-类别概率分布。S110: Build a SLDA-TC text classification model; unlike the unsupervised LDA model, each document in the training document set of the SLDA-TC text classification model has a class label; the parameters to be estimated in the SLDA-TC text classification model include text - Topic probability distribution, topic-word probability distribution and topic-category probability distribution.

如图2所示，LDA主题模型的文本集合M表示文本数，K表示主题数。模型有两个参数θ_m和其中θ_m表示第m篇文本的主题概率分布，表示主题k的词概率分布，w_m是第m篇文本的词袋向量，N_m表示第m篇文本的长度，w_m,n表示第m篇文本中第n个词，z_m,n是分配给w_m,n的主题。θ_m和服从Dirichlet分布，作为多项式参数分别生成主题、词，α、β是相应的Dirichlet 分布的先验参数。LDA主题模型不考虑每个文档的类别。As shown in Figure 2, the text collection of the LDA topic model M represents the number of texts, and K represents the number of topics. The model has two parameters θ _m and where θm represents the topic probability distribution of the _mth text, Represents the word probability distribution of topic k, w _m is the word bag vector of the m-th text, N _m represents the length of the m-th text, w _m,n represents the n-th word in the m-th text, z _m,n is Topic assigned to w _m,n . _θm and Subject to Dirichlet distribution, topics and words are generated as multinomial parameters, α and β are the prior parameters of the corresponding Dirichlet distribution. The LDA topic model does not consider the category of each document.

如图3所示，与LDA不同，SLDA-TC模型的训练文本集每篇文本w_m存在一个可观察的类别号y_m∈[1,C],C表示类别数，假设类别号服从与文本主题概率相关的多项式分布。w_m,n和y_m是可观察的，z_m,n是隐含主题。 SLDA-TC模型除了参数和θ_m，本公开引入了一个新参数δ_k，表示第k个主题的类别概率分布。θ_m、和δ_k服从Dirichlet分布，作为多项式参数分别生成主题、词和类别，α、β和γ是相应的Dirichlet分布的先验参数。As shown in Figure 3, unlike LDA, the training text set of SLDA-TC model Each text w _m has an observable category number y _m ∈ [1, C], C represents the number of categories, assuming that the category number obeys a multinomial distribution related to the text topic probability. w _m,n and y _m are observables and z _m,n are implicit subjects. SLDA-TC model in addition to parameters and θ _m , the present disclosure introduces a new parameter δ _k , representing the class probability distribution of the k-th topic. θ _m , and _δk obey the Dirichlet distribution as multinomial parameters to generate topics, words and categories, respectively, α, β and γ are the prior parameters of the corresponding Dirichlet distribution.

S120：SLDA-TC主题模型参数估计，训练SLDA-TC文本分类模型，建立主题与类别之间的映射。S120: Estimate the parameters of the SLDA-TC topic model, train the SLDA-TC text classification model, and establish the mapping between topics and categories.

LDA采用Gibbs Sampling算法估计的参数是和θ，SLDA-TC模型需要估计的参数是θ和δ。本公开在Gibbs Sampling方法的基础上提出了 SLDA-TC-Gibbs算法，不直接计算θ_m、和δ_k，先对每个词的隐含主题进行采样，在确定每个词的隐含主题之后,θ_m、和δ_k可以统计频次计算得到。The parameters estimated by LDA using Gibbs Sampling algorithm are and θ, the parameters that the SLDA-TC model needs to estimate are theta and delta. The present disclosure proposes the SLDA-TC-Gibbs algorithm on the basis of the Gibbs Sampling method, and does not directly calculate θ _m , and δ _k , first sample the implicit topic of each word, after determining the implicit topic of each word, θ _m , and δ _k can be calculated by statistical frequency.

Gibbs Sampling算法每次对词w一个隐变分量z_i＝k进行采样，保持z的其它词分量即不变，计算如式(1)。The Gibbs Sampling algorithm samples a latent variable component _zi = k of the word w each time, and keeps the other word components of z, namely unchanged, the calculation is as in formula (1).

SLDA-TC-Gibbs算法每次对词w_i＝t的一个隐变分量z_i＝k进行采样时，保持 z的其它词分量即不变，同时也保持不变，计算如式(2)。Each time the SLDA-TC-Gibbs algorithm samples a latent variable component _zi = k of the word w _i =t, it keeps the other word components of z, namely unchanged, while maintaining unchanged, the calculation is as in formula (2).

其中，w＝(w₁,...,w_m)是所有文档词向量，z＝(z₁,...,z_V)是主题向量(V是词字典长度)，y＝(y₁,...y_M)是类别向量。假设第i个单词w_i＝t，z_i表示第i个单词对应的主题变量，表示剔除向量z的第i项。同时假设第i个单词w_i＝t所在文档是第m篇文档w_m，且其类标号为y_m＝j,j∈[1,C]，那么表示剔除向量y 的第m项。表示主题k分配给词v的次数，β_v表示词v的Dirichlet先验。表示文档m分配给主题z的次数，α_z是主题z的Dirichlet先验。表示剔除z 的第i项(即第i个词w_i＝t)，主题k分配给词t的次数，β_v是词t的Dirichlet先验。表示剔除第m项(第m篇文档w_m的类标记为y_m＝j)，主题k分配为类别j的文档数，γ_j表示类别j的Dirichlet先验参数。where w=(w ₁ ,...,w _m ) is the word vector for all documents, z=(z ₁ ,...,z _V ) is the topic vector (V is the word dictionary length), y=(y ₁ ,...y _M ) is the category vector. Assuming that the ith word w _i =t, _zi represents the topic variable corresponding to the ith word, Represents the ith item of the culling vector z. At the same time, it is assumed that the document where the i-th word w _i =t is located is the m-th document w _m , and its class label is y _m =j,j∈[1,C], then Represents the mth item of the culling vector y. represents the number of times topic k is assigned to word v, and β _v represents the Dirichlet prior of word v. represents the number of times document m is assigned to topic z, α _z is the Dirichlet prior of topic z. represents the number of times the topic k is assigned to the word t, excluding the ith item of z (ie the ith word _wi = t), and β _v is the Dirichlet prior of the word t. Indicates that the m-th item is removed (the class of the m-th document w _m is marked as y _m =j), the topic k is assigned as the number of documents in the category j, and γ _j represents the Dirichlet prior parameter of the category j.

对比式(1)，在式(2)左边引入右边增加一项对第m篇文档中词的隐含主题采样进行了限制，即只从与第m篇文档类别相同的其它训练文档进行隐含主题采样，因为相同类别的文档的主题分布是相似的，所以对有类别标注时公式(2)更合理。本公开只生成一个SLDA-TC模型，只需估计一组参数θ和δ。公式(2)的推导证明如下：Compared with formula (1), it is introduced on the left side of formula (2) Add an item to the right The implicit topic sampling of words in the mth document is restricted, that is, the implicit topic sampling is only performed from other training documents of the same category as the mth document, because the topic distributions of the documents of the same category are similar, so Formula (2) is more reasonable when there are category labels. The present disclosure generates only one SLDA-TC model and only needs to estimate a set of parameters theta and delta. The derivation of formula (2) is proved as follows:

证明：给定训练文档集令w＝(w₁,...,w_m)，y＝(y₁,...y_M)， z＝(z₁,...,z_V)，SLDA-TC概率模型的联合分布如(3)式。Proof: Given a set of training documents Let w=(w ₁ ,...,w _m ), y=(y ₁ ,... y _M ), z=(z ₁ ,...,z _V ), the joint distribution of the SLDA-TC probability model Such as (3) formula.

由Dirichlet分布可知：According to the Dirichlet distribution:

其中，Γ(·)是 Gamma函数。in, Γ(·) is a Gamma function.

根据SLDA-TC-Gibbs算法，According to the SLDA-TC-Gibbs algorithm,

对第m篇文档的词w_i＝t,是常量的,故For the word w _i =t of the mth document, is constant, so

由(3)-(6)和(8),可得From (3)-(6) and (8), we can get

由此，(2)式可证。Therefore, (2) can be proved.

获得每个单词w的主题z标号后，SLDA-TC模型的参数θ_m和δ_k计算如下。After obtaining the topic z label of each word w, the parameters of the SLDA-TC model θ _m and δ _k are calculated as follows.

其中表示主题k分配给词t的概率，θ_m,k表示文档m分配为主题k的概率，δ_k,j表示主题k属于类别j的概率，为表示主题k分配给词t的次数，表示文档m分配给主题k的次数，表示主题是k分配为类别j的文档数，k＝1..K, t＝1..V,m＝1..M。in represents the probability that topic k is assigned to word t, θ _m,k represents the probability that document m is assigned to topic k, δ _k,j represents the probability that topic k belongs to category j, to represent the number of times that topic k is assigned to word t, represents the number of times document m is assigned to topic k, Denotes the number of documents whose topic is k assigned to category j, k=1..K, t=1..V, m=1..M.

由此SLDA-TC模型的参数θ和δ估算完毕。From this the parameters of the SLDA-TC model θ and δ are estimated.

SLDATC-Gibbs算法描述：SLDATC-Gibbs algorithm description:

算法: algorithm:

输入:文档向量超参数α,β,γ,主题数K,迭代次数T.Input: document vector Hyperparameters α, β, γ, number of topics K, number of iterations T.

输出:主题z分布,参数θ和δ.Output: topic z distribution, parameters theta and delta.

初始化：变量初始化为0.Initialize variables Initialized to 0.

训练生成SLDA-TC模型后，对第d篇新文档中每个词的隐含主题推断公式如(13)。After training and generating the SLDA-TC model, the implicit topic inference formula for each word in the d-th new document is shown in (13).

其中，表示第d篇新文档的词向量，代表其主题向量，表示新文档剔除第i项(即第i个词)，主题k分配给词t的次数，表示剔除第i项，第d篇新文档分配给主题k的次数，其它符号含义参考公式(2)。in, represents the word vector of the d-th new document, represents its topic vector, Indicates that the new document removes the i-th item (ie the i-th word ), the number of times topic k is assigned to word t, Represents the number of times that the i-th item is removed and the d-th new document is assigned to the topic k. Refer to formula (2) for the meaning of other symbols.

由式(13)计算新文档d每个词的隐含主题标号，然后计算d属于各个主题的概率如式(14)。Calculate the implicit topic label of each word in the new document d by equation (13), and then calculate the probability that d belongs to each topic as in equation (14).

由此，得到新文档d的主题概率分布 From this, the topic probability distribution of the new document d is obtained

S130：将待测文本输入至训练完成的SLDA-TC文本分类模型，推断出待测文本的主题，进而预测出文本分类。S130: Input the text to be tested into the trained SLDA-TC text classification model, infer the subject of the text to be tested, and then predict the text classification.

给定训练好SLDA-TC模型，令为第d篇新文档的词向量，为其主题向量，是对新文档d的类别预测，新文档的分类计算如下。Given a trained SLDA-TC model, let is the word vector of the d-th new document, is its theme vector, is the class prediction for the new document d, and the classification of the new document is calculated as follows.

这里假设测试样本集与训练样本集服从同分布，即二者在隐含主题-类别分布上一致，因此，可以由p(y|z)代替，由SLDA-TC模型的参数δ揭示。是新文档d的主题概率分布由式(12)、(14)和(15)，可得：It is assumed here that the test sample set and the training sample set obey the same distribution, that is, the two are consistent in the implicit topic-category distribution. Therefore, can be replaced by p(y|z), revealed by the parameter δ of the SLDA-TC model. is the topic probability distribution of the new document d From equations (12), (14) and (15), we can get:

主题之间的相似度用JS散度(Jensen-Shannon divergence)评估，JS散度也称JS距离，是KL散度(Kullback–Leibler divergence)的一种变形，计算如公式(17)。不同于KL散度，JS散度满足对称性和三角距离公式。The similarity between topics is evaluated by JS divergence (Jensen-Shannon divergence). JS divergence, also called JS distance, is a variant of KL divergence (Kullback–Leibler divergence), calculated as formula (17). Unlike the KL divergence, the JS divergence satisfies the symmetry and triangular distance formulas.

JS(p_i||p_j)＝0.5*KL(p_i||(p_i+p_j)/2)+0.5*KL(p_j||(p_i+p_j)/2) (17)JS(pi ||p _j )=0.5*KL(pi ||( _pi +p _j )/2) _{+0.5*KL(p j ||(pi +p j} ₎ _/ ₂ ₎ (17)

其中，p_i和p_j分别表示主题i和主题j的词概率分布，JS散度的值域范围是[0,1]，0表示p_i和p_j分布相同，1表示相反。Among them, p _i and p _j represent the word probability distribution of topic i and topic j respectively, the range of JS divergence is [0, 1], 0 means p _i and p _j have the same distribution, and 1 means opposite.

主题与类别之间的语义相关度由SLDA-TC模型的参数δ度量，计算如式(12) 所示，δ_k,j表示主题k属于类别j的概率。The semantic relevance between topics and categories is measured by the parameter δ of the SLDA-TC model, which is calculated as shown in equation (12), where δ _k,j represents the probability that topic k belongs to category j.

在一个或多个实施例中，通过多次迭代训练生成用于分类的SLDA-TC模型，迭代结束，通过JS散度评估主题之间的相似度，通过SLDA-TC的主题-类别分布参数评估主题与类别之间的语义相关度；SLDA-TC模型的文本分类结果评价指标宏平均分类精度(Macro-precision)、宏平均召回率(Macro-Recall)和宏平均F1值(Macro-F1)。In one or more embodiments, the SLDA-TC model for classification is generated through multiple iterations of training, and the iteration ends, the similarity between topics is evaluated by JS divergence, and the topic-category distribution parameter of SLDA-TC is evaluated Semantic relevance between topics and categories; SLDA-TC model text classification results evaluation metrics Macro-precision, macro-recall (Macro-Recall), and macro-average F1 value (Macro-F1).

实验分析验证：Experimental analysis and verification:

选取英文数据集20newsgroup的rec、sci和talk三个数据子集，以及包含IT、军事、教育、旅游和财经5个类的搜狗中文语料数据子集，每个数据子集中训练样本和测试样本比例为8:2，数据子集描述如表1所述。中英文数据集的分词采用jieba分词，英文词干提取采用的是nltk.stem，去除停用词后，采用TF-IDF进行特征选择，实验中保留60％特征词。Select three data subsets of rec, sci and talk from the English data set 20newsgroup, as well as the Sogou Chinese corpus data subsets containing 5 categories of IT, military, education, tourism and finance, and the proportion of training samples and test samples in each data subset is 8:2, and the data subsets are described in Table 1. The Chinese and English datasets are segmented by jieba, and the English stem is extracted by nltk.stem. After removing stop words, TF-IDF is used for feature selection, and 60% of the feature words are retained in the experiment.

表1.数据集描述Table 1. Dataset description

数据集data set 训练文本数number of training texts 类别数Number of categories 特征数number of features 20news-rec20news-rec 39793979 44 4706747067 20news-sci20news-sci 23732373 44 5794357943 20news-talk20news-talk 16761676 33 4094540945 sogousogou 24452445 55 70819 70819

为验证所提方法的有效性，对SLDA-TC、LDA-TC和SVM三个算法进行了实验比较。SLDA-TC是本公开中提出的算法，LDA-TC是在传统LDA直接分类， SVM是采用LDA模型的K个主题作为特征的SVM分类算法。In order to verify the effectiveness of the proposed method, the three algorithms of SLDA-TC, LDA-TC and SVM are compared experimentally. SLDA-TC is an algorithm proposed in this disclosure, LDA-TC is a direct classification in traditional LDA, and SVM is an SVM classification algorithm that uses K topics of the LDA model as features.

SLDA-TC模型的主题推断的目标是建立主题与类别之间的映射，主题数K 与标记训练集的类别数C有关，实验表明K只需取略大于C的一个值，SLDA-TC 就可以达到很好的分类精度。实验中，我们对SLDA-TC主题模型生成的主题间的JS距离、每个主题的前10个特征词的概率分布、类与主题间的相似度进行了分析。如表2～表4描述的是sogou数据子集C＝5、K＝8时生成的SLDA-TC主题模型的实验结果，表6～表8描述的是20news-talk数据子集上的实验结果，其中 SLDA-TC主题模型的α、β和γ取值为0.01。The goal of the topic inference of the SLDA-TC model is to establish a mapping between topics and categories. The number of topics K is related to the number of categories C in the labeled training set. Experiments show that K only needs to take a value slightly larger than C, and SLDA-TC can achieve good classification accuracy. In the experiment, we analyzed the JS distance between topics generated by the SLDA-TC topic model, the probability distribution of the top 10 feature words of each topic, and the similarity between classes and topics. Tables 2 to 4 describe the experimental results of the SLDA-TC topic model generated when the sogou data subset C=5 and K=8, and Tables 6 to 8 describe the experimental results on the 20news-talk data subset , where α, β and γ of the SLDA-TC topic model are 0.01.

表2主题间JS散度(SLDA-TC,sogou,C＝5,K＝8)Table 2 JS divergence between subjects (SLDA-TC, sogou, C=5, K=8)

表3主题与类的相关度(SLDA-TC,sogou,C＝5,K＝8)Table 3 Relevance between topics and classes (SLDA-TC, sogou, C=5, K=8)

表4主题前10个词的概率分布(SLDA-TC,sogou,C＝5,K＝8)Table 4 Probability distribution of the top 10 words of the topic (SLDA-TC, sogou, C=5, K=8)

如表2所示，主题2、5、7之间的JS散度为0，说明它们是分布相同的主题，从表4可以看出，这3个主题前10个特征词的概率分布也是相同的。其它 5个主题之间的JS散度在0.45至0.58之间，说明是分散的5个不同主题。从表 3和表4可以看出，主题0映射类别“IT”，主题1映射类别“旅游”，主题3映射类别“财经”，主题4映射类别“军事”，主题6映射类别“教育”，相似度均在99％以上，而主题2、5、7是与任何类别无关的，称之为“无用”主题。As shown in Table 2, the JS divergence between topics 2, 5, and 7 is 0, indicating that they are topics with the same distribution. It can be seen from Table 4 that the probability distribution of the top 10 feature words of these three topics is also the same of. The JS divergence between the other 5 topics is between 0.45 and 0.58, indicating that there are 5 different topics scattered. As can be seen from Tables 3 and 4, topic 0 maps to the category "IT", topic 1 maps the category "tourism", topic 3 maps the category "Finance", topic 4 maps the category "military", topic 6 maps the category "education", The similarity is above 99%, and topics 2, 5, and 7 are irrelevant to any category, and are called "useless" topics.

表5.主题间JS散度(SLDA-TC 20news-talk,C＝3,K＝6)Table 5. Inter-topic JS divergence (SLDA-TC 20news-talk, C=3, K=6)

表6类与主题的相关度(SLDA-TC,20news-talk,C＝3,K＝6)Table 6. Relevance between categories and topics (SLDA-TC, 20news-talk, C=3, K=6)

表7主题前10个词的概率分布(SLDA-TC,20news-talk,C＝3,K＝6)Table 7 Probability distribution of the top 10 words of the topic (SLDA-TC, 20news-talk, C=3, K=6)

表5～表7所示的是20news-talk上的实验结果，主题1、3和4是分布相同的“无用”主题，有意义的主题0对应类别talk.politics.guns，主题2对应 talk.politics.misc，主题5对应talk.politics.mideast。Tables 5 to 7 show the experimental results on 20news-talk. Topics 1, 3 and 4 are "useless" topics with the same distribution, meaningful topic 0 corresponds to the category talk.politics.guns, and topic 2 corresponds to talk. politics.misc, topic 5 corresponds to talk.politics.mideast.

经过大量实验我们发现，当K-C≥2时，会产生K-C个JS散度为0的主题的 “无用”主题，因此K只要选择略大于C的值即可，所以SLDA-TC主题模型的K 是容易确定的。SLDA-TC主题模型能够过滤掉K-C个“无用”主题，建立主题与类别的准确映射。同时，K只略大于C，远低于LDA的K值，能显著降低模型的训练时间。After a lot of experiments, we found that when K-C≥2, K-C "useless" topics of topics with JS divergence of 0 will be generated, so K only needs to choose a value slightly larger than C, so the K of the SLDA-TC topic model is easily determined. The SLDA-TC topic model can filter out K-C "useless" topics and establish an accurate mapping between topics and categories. At the same time, K is only slightly larger than C, much lower than the K value of LDA, which can significantly reduce the training time of the model.

图4(a)～图7(c)描述的是在sogou中文、20newsgroup的3个子数据集上，经过TF-IDF特征选择保留60％的特征词，主题数K取不同的值迭代10次生成SLDA模型和LDA模型，SLDA-TC、LDA-TC和SVM的分类结果比较。Figures 4(a) to 7(c) describe the three sub-data sets of sogou Chinese and 20newsgroup. After TF-IDF feature selection, 60% of the feature words are retained, and the number of topics K takes different values to iterate 10 times to generate Comparison of classification results between SLDA model and LDA model, SLDA-TC, LDA-TC and SVM.

在图4(a)～图7(c)中，横坐标均表示迭代次数。In FIGS. 4( a ) to 7 ( c ), the abscissas all represent the number of iterations.

图4(a)～图7(c)描述了4个数据集上主题数K取不同值时，SLDA-TC、 LDA-TC和SVM的分类结果比较。Figures 4(a) to 7(c) describe the comparison of the classification results of SLDA-TC, LDA-TC and SVM when the number of topics K on the four datasets takes different values.

(1)SLDA-TC的主题时K只需略大于类别数C，分类结果优于LDA-TC 和SVM。(1) The subject K of SLDA-TC only needs to be slightly larger than the number of categories C, and the classification result is better than that of LDA-TC and SVM.

如图4(a)-图4(c)所示，20news-rec数据集C＝4，K＝8时，对于Macro-Precision、Macro-Recall和Macro-F1分类指标，SLDA-TC为95.10％、94.99％和94.98％，而LDA-TC为63.76％、60.91％和60.33％，SVM为68.82％、68.33％、68.08％。随着K增大，LDA-TC和SVM有所提高，K＝80时，LDA-TC最高达到为71.85％、 71.38％和71.41％，SVM最高达到83.90％、83.70％和83.62％，也低于SLDA-TC。另外K＝80，其主题模型的训练时间远远高于SLDA-TC模型K＝6的训练时间。As shown in Fig. 4(a)-Fig. 4(c), when the 20news-rec dataset C=4, K=8, the SLDA-TC is 95.10% for the Macro-Precision, Macro-Recall and Macro-F1 classification metrics , 94.99%, and 94.98%, while LDA-TC was 63.76%, 60.91%, and 60.33%, and SVM was 68.82%, 68.33%, and 68.08%. With the increase of K, LDA-TC and SVM are improved. When K=80, LDA-TC reaches 71.85%, 71.38% and 71.41%, and SVM reaches 83.90%, 83.70% and 83.62%, which are also lower than SLDA-TC. In addition, K=80, the training time of its topic model is much higher than that of the SLDA-TC model K=6.

如图5(a)-图5(c)所示所示，sogou数据集C＝5，K＝8时，对于Macro-Precision、Macro-Recall和Macro-F1分类指标，SLDA-TC为92.80％、92.73％和92.70％，而LDA-TC为72.67％、68.89％和67.48％，SVM为80.69％、80.40％、80.28％。随着K的增大，SVM的分类指标逐步提高，当K大于60时，三种分类指标已经提高为89.26％、89.95％、89.24％，但还是低于K＝8时SLDA-TC的92.80％、 92.73％和92.70％，并且K＝60的SVM(LDA主题为特征)，比K＝8的SLDA-TC 要付出更多的时间代价。As shown in Figure 5(a)-Figure 5(c), when the sogou dataset C=5, K=8, the SLDA-TC is 92.80% for the Macro-Precision, Macro-Recall and Macro-F1 classification metrics , 92.73%, and 92.70%, while LDA-TC was 72.67%, 68.89%, and 67.48%, and SVM was 80.69%, 80.40%, and 80.28%. With the increase of K, the classification index of SVM gradually improves. When K is greater than 60, the three classification indexes have been improved to 89.26%, 89.95% and 89.24%, but they are still lower than 92.80% of SLDA-TC when K=8 , 92.73% and 92.70%, and SVM with K=60 (LDA topic as feature), pays more time cost than SLDA-TC with K=8.

(2)SLDA-TC的K值不是越大越好，当K很大时，SLDA-TC的三个分类指标反倒有所下降，这表明K只需略大于C即可。(2) The K value of SLDA-TC is not as large as possible. When K is very large, the three classification indicators of SLDA-TC decrease, which shows that K only needs to be slightly larger than C.

如图6(a)-图6(c)所示所示，20news-sci数据集C＝4，当K＝90时，SLDA-TC 分类结果反倒很差。原因在于类别数C＝4，生成的90个主题中只有4个主题与类别相关，其余86个是“无用”主题，对分类没有帮助，反而造成了干扰，导致分类结果变差。如图7(a)-图7(c)所示的20news-talk数据集C＝3，K＝90时也是如此。大量的实验结果表明，对于SLDA-TC算法，K取略大于C的值，即可获得很高的分类精度。As shown in Figure 6(a)-Figure 6(c), the 20news-sci data set C=4, when K=90, the SLDA-TC classification result is very poor. The reason is that the number of categories C=4, only 4 of the generated 90 topics are related to the category, and the remaining 86 are "useless" topics, which do not help the classification, but cause interference, resulting in poor classification results. The same is true for the 20news-talk dataset C=3, K=90 as shown in Fig. 7(a)-Fig. 7(c). A large number of experimental results show that, for the SLDA-TC algorithm, if K is slightly larger than C, high classification accuracy can be obtained.

SLDA-TC与LDA-TC、SVM在不同数据集上的时间性能及分类结果比较如表8所示。The time performance and classification results of SLDA-TC, LDA-TC and SVM on different datasets are shown in Table 8.

表8.SLDA-TC与LDA-TC、SVM时间性能及分类结果比较Table 8. Comparison of SLDA-TC and LDA-TC, SVM time performance and classification results

主题模型的生成时间与主题数K成正比，K值越大，时间代价越高。对 SLDA-TC模型，主题数K只需略大于类别数C即可获得非常好的分类结果，而 LDA-TC和采用LDA主题为特征的SVM算法则需要K值达到几十、上百时，才能取得较好的分类结果，从图4(a)-图7(d)所示的实验结果也可以看出。The generation time of the topic model is proportional to the number of topics K. The larger the value of K, the higher the time cost. For the SLDA-TC model, the number of topics K only needs to be slightly larger than the number of categories C to obtain very good classification results, while LDA-TC and the SVM algorithm using LDA topics as features require the K value to reach tens or hundreds of times. In order to obtain better classification results, it can also be seen from the experimental results shown in Figure 4(a)-Figure 7(d).

如表8所示，在20news-rec上，LDA与SLDA-TC模型生成时间比率为4.86，即SLDA比LDA快4.86倍，因为二者的K分别为200和8。在20news-sci、 20news-talk和sogou数据集上，SLDA比LDA分别快4.78、5.16和4.90倍。同时，从Macro-Precision、Macro-Recall和Macro-F1分类指标来看，SLDA-TC算法明显优于LDA-TC和SVM算法，4种数据集上，SLDA-TC比SVM高出 3.10％～9.30％，比LDA-TC高出7.10％～34.08％。As shown in Table 8, on 20news-rec, the LDA and SLDA-TC model generation time ratio is 4.86, that is, SLDA is 4.86 times faster than LDA, because the K of the two are 200 and 8, respectively. On the 20news-sci, 20news-talk and sogou datasets, SLDA is 4.78, 5.16 and 4.90 times faster than LDA, respectively. At the same time, from the perspective of Macro-Precision, Macro-Recall and Macro-F1 classification indicators, the SLDA-TC algorithm is significantly better than the LDA-TC and SVM algorithms. On the four data sets, SLDA-TC is 3.10% higher than SVM. 9.30% %, 7.10%～34.08% higher than LDA-TC.

综上所述，SLDA-TC模型主题数K只需取略大于类别数C的值，能够识别与类别密切相关的主题，并且在分类精度和时间性能上都明显优于LDA-TC和采用LDA主题为特征的SVM算法。To sum up, the number of topics K of the SLDA-TC model only needs to take a value slightly larger than the number of categories C, which can identify topics closely related to categories, and is significantly better than LDA-TC and LDA in both classification accuracy and time performance. Subject-based feature SVM algorithm.

本公开针对LDA在文本分类中存在的问题，提出了一种基于有监督主题模型的SLDA-TC文本分类模型，提出了SLDA-TC-Gibbs参数估计算法，每次对词w_i＝t的一个隐变分量z_i＝k进行采样时，保持z的其它词分量即不变，同时也保持不变，即只从该词所在文档类标签相同的其它训练文档中进行隐含主题采样，因为相同类别的文档的主题分布是相似的，并给出了理论证明。 SLDA-TC模型引入了主题-类别概率分布的参数δ，通过θ和δ概率分布，抽取词与主题、文档与主题、主题与类别之间隐含的语义信息映射。另外， SLDA-TC主题数K只需略大于类别数的值。实验表明SLDA-TC模型能够明显提高分类精度和时间效率。Aiming at the problems existing in LDA in text classification, the present disclosure proposes a SLDA-TC text classification model based on a supervised topic model, and proposes an _SLDA -TC-Gibbs parameter estimation algorithm. When the latent variable component _zi = k is sampled, keep the other word components of z, namely unchanged, while maintaining Invariant, that is, implicit topic sampling is only performed from other training documents with the same document class label as the word, because the topic distribution of documents of the same category is similar, and a theoretical proof is given. The SLDA-TC model introduces the parameter δ of the topic-category probability distribution, via Theta and delta probability distributions, extract the implicit semantic information mapping between words and topics, documents and topics, topics and categories. In addition, the number K of SLDA-TC topics only needs to be slightly larger than the number of categories. Experiments show that the SLDA-TC model can significantly improve the classification accuracy and time efficiency.

本公开的一个或多个实施例，还提供了一种文本分类系统，包括文本输入装置、控制器和显示装置，所述控制器包括存储器和处理器，所述存储器存储有计算机程序，所述程序被处理器执行时能够实现如图1所示的以下步骤：One or more embodiments of the present disclosure further provide a text classification system, including a text input device, a controller and a display device, the controller includes a memory and a processor, the memory stores a computer program, the When the program is executed by the processor, it can implement the following steps as shown in Figure 1:

(1)构建SLDA-TC文本分类模型；不同于无监督LDA模型，SLDA-TC 文本分类模型的训练文档集的每个文档带有类别标签；SLDA-TC文本分类模型中需要估计的参数不仅包括文本-主题概率分布、主题-词概率分布，还包括主题 -类别概率分布；其中，文本主题概率分布、主题的词概率分布和主题的类别概率分布均服从Dirichlet分布。(1) Build the SLDA-TC text classification model; unlike the unsupervised LDA model, each document in the training document set of the SLDA-TC text classification model has a class label; the parameters to be estimated in the SLDA-TC text classification model not only include Text-topic probability distribution, topic-word probability distribution, and topic-category probability distribution; among them, text topic probability distribution, topic word probability distribution, and topic category probability distribution all obey Dirichlet distribution.

(2)训练SLDA-TC文本分类模型，进行SLDA-TC模型参数估计。(2) Train the SLDA-TC text classification model and estimate the parameters of the SLDA-TC model.

具体地，在训练SLDA-TC文本分类模型的过程中，先设置主题数K，取略大于类别数C的值；然后按照SLDA-TC-Gibbs算法对每个词的隐含主题进行采样，且只从与该词所在文本类别标签相同的其它训练文本中进行隐含主题采样；在确定每个词的隐含主题之后，通过统计主题-词、文档-主题、主题-类别的频次，计算得到文本-主题概率分布、主题-词概率分布和主题-类别概率分布；建立主题与类别之间的准确映射；Specifically, in the process of training the SLDA-TC text classification model, the number of topics K is set first, and a value slightly larger than the number of categories C is set; then the implicit topics of each word are sampled according to the SLDA-TC-Gibbs algorithm, and Only perform implicit topic sampling from other training texts with the same text category label as the word; after determining the implicit topic of each word, by counting the frequency of topic-word, document-topic, and topic-category, calculate Text-topic probability distribution, topic-word probability distribution and topic-category probability distribution; establish accurate mapping between topics and categories;

(3)待测文本主题推断和分类。将待测文本输入至训练完成的SLDA-TC 文本分类模型，首先对待测文档每个词进行隐含主题采样；然后推断待测文本的主题概率分布；根据待测文档的主题分布和SLDA-TC模型的主题-类别分布，输出待测文本的类别标签。(3) Topic inference and classification of the text to be tested. Input the text to be tested into the trained SLDA-TC text classification model, firstly sample the implicit topic of each word in the document to be tested; then infer the topic probability distribution of the text to be tested; according to the topic distribution of the document to be tested and SLDA-TC The topic-category distribution of the model, which outputs the category labels of the text to be tested.

(4)SLDA-TC模型和分类结果评估。对多次迭代训练生成的SLDA-TC模型，通过JS散度评估主题之间的相似度，通过SLDA-TC的主题-类别分布参数评估主题与类别之间的语义相关度；SLDA-TC模型的文本分类结果评价由指标宏平均分类精度(Macro-precision)、宏平均召回率(Macro-Recall)和宏平均F1 值(Macro-F1)评估。(4) SLDA-TC model and classification result evaluation. For the SLDA-TC model generated by multiple iterative training, the similarity between topics is evaluated by JS divergence, and the semantic correlation between topics and categories is evaluated by the topic-category distribution parameter of SLDA-TC; The evaluation of text classification results is evaluated by the metrics macro-average classification precision (Macro-precision), macro-average recall (Macro-Recall), and macro-average F1 value (Macro-F1).

本公开的文本分类方法和系统，通过构建及训练完成的SLDA-TC文本分类模型，利用文本主题概率分布、主题的词概率分布和主题的类别概率分布，抽取词与主题、文档与主题、主题与类别之间隐含的语义信息映射，提高了文本分类精度和时间效率。The text classification method and system of the present disclosure, by constructing and training the completed SLDA-TC text classification model, utilizes the text topic probability distribution, the word probability distribution of the topic and the category probability distribution of the topic to extract words and topics, documents and topics, and topics. The implicit semantic information mapping between categories improves text classification accuracy and time efficiency.

本领域内的技术人员应明白，本公开的实施例可提供为方法、系统、或计算机程序产品。因此，本公开可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且，本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.

本公开是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和 /或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM) 或随机存储记忆体(RandomAccessMemory，RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM) or the like.

上述虽然结合附图对本公开的具体实施方式进行了描述，但并非对本公开保护范围的限制，所属领域技术人员应该明白，在本公开的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本公开的保护范围以内。Although the specific embodiments of the present disclosure have been described above in conjunction with the accompanying drawings, they do not limit the protection scope of the present disclosure. Those skilled in the art should understand that on the basis of the technical solutions of the present disclosure, those skilled in the art do not need to pay creative efforts. Various modifications or variations that can be made are still within the protection scope of the present disclosure.

Claims

1. a text classification method based on a supervised topic model, is characterized in that, comprises:

Build the SLDA-TC text classification model. Each document in the training document set of the SLDA-TC text classification model has a class label; the parameters that need to be estimated in the SLDA-TC text classification model include not only text-topic probability distribution, topic-word probability distributions, including topic-category probability distributions;

The SLDA-TC text classification model is trained, and the SLDA-TC model parameters are estimated according to the SLDA-TC-Gibbs algorithm. The topic is sampled, and the implicit topic is only sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, by statistical topic-word, document-topic, topic-category The frequency of text-topic, topic-word probability distribution and topic-category probability distribution are calculated, and then an accurate mapping between topics and categories is established;

Inference and classification of the subject of the text to be tested; input the text to be tested into the trained SLDA-TC text classification model, first perform implicit topic sampling for each word of the document to be tested; then infer the subject probability distribution of the text to be tested; The topic distribution of the document and the topic-category distribution of the SLDA-TC model, output the category label of the text to be tested.

2. a kind of text classification method based on supervised topic model as claimed in claim 1, is characterized in that, the parameter that needs to be estimated in SLDA-TC text classification model not only comprises text-topic probability distribution, topic-word probability distribution, Also includes topic-category probability distributions with parameters obeying Dirichlet distributions.

3. a kind of text classification method based on supervised topic model as claimed in claim 1, it is characterised in that the SLDA-TC model that is used for text classification is generated by repeatedly iterative training, the iteration ends, and the subject is assessed by JS divergence The similarity between topics and categories is evaluated by the topic-category distribution parameter of SLDA-TC.

4 . The method for text classification based on a supervised topic model according to claim 3 , wherein the evaluation indicators for the text classification results include macro-average classification precision, macro-average recall rate, and macro-average F1 value. 5 .

5. A text classification system based on a supervised topic model, comprising a text input device, a controller and a display device, the controller comprising a memory and a processor, wherein the memory stores a computer program, the program The following steps can be implemented when executed by the processor:

The SLDA-TC text classification model is trained, and the parameters of the SLDA-TC model are estimated according to the SLDA-TC-Gibbs algorithm. The topic is sampled, and only the implicit topic is sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, by statistical topic-word, document-topic, topic-category The frequency of text-topic, topic-word probability distribution and topic-category probability distribution are calculated, and then an accurate mapping between topics and categories is established;

6 . The text classification system based on a supervised topic model according to claim 5 , wherein the text topic probability distribution, the topic word probability distribution and the topic category probability distribution all obey the Dirichlet distribution. 7 .

7. A kind of text classification system based on supervised topic model as claimed in claim 5, it is characterized in that, generate the SLDA-TC model that is used for text classification through repeated iterative training, the iteration ends, and evaluates the topic by JS divergence The similarity between topics and categories is evaluated by the topic-category distribution parameter of SLDA-TC.

8 . The text classification system based on a supervised topic model according to claim 7 , wherein the evaluation indicators of the text classification results include macro average classification accuracy, macro average recall rate and macro average F1 value. 9 .