CN109408641A - It is a kind of based on have supervision topic model file classification method and system - Google Patents
It is a kind of based on have supervision topic model file classification method and system Download PDFInfo
- Publication number
- CN109408641A CN109408641A CN201811398232.1A CN201811398232A CN109408641A CN 109408641 A CN109408641 A CN 109408641A CN 201811398232 A CN201811398232 A CN 201811398232A CN 109408641 A CN109408641 A CN 109408641A
- Authority
- CN
- China
- Prior art keywords
- topic
- text
- slda
- model
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本公开提供了一种基于有监督主题模型的文本分类方法及系统。其中,一种基于有监督主题模型的文本分类方法,包括:构建SLDA‑TC文本分类模型;在训练SLDA‑TC文本分类模型的过程中,按照SLDA‑TC‑Gibbs算法对每个词的隐含主题进行采样,且只从与该词所在文本类别标签相同的其它训练文本中进行隐含主题采样;在确定每个词的隐含主题之后,通过统计频次,计算得到文本‑主题概率分布、主题‑词概率分布和主题‑类别概率分布;建立主题与类别之间的准确映射;将待测文本输入至训练生成的SLDA‑TC文本分类模型,推断出待测文本的主题,进而预测文本的类别。
The present disclosure provides a text classification method and system based on a supervised topic model. Among them, a text classification method based on a supervised topic model includes: constructing a SLDA-TC text classification model; in the process of training the SLDA-TC text classification model, according to the SLDA-TC-Gibbs algorithm, the implicit classification of each word is performed. The topic is sampled, and the implicit topic is only sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, the text-topic probability distribution, topic probability distribution, topic ‑Word probability distribution and topic‑category probability distribution; establish accurate mapping between topics and categories; input the text to be tested into the SLDA‑TC text classification model generated by training, infer the topic of the text to be tested, and then predict the category of the text .
Description
技术领域technical field
本公开涉及数据分类领域,尤其涉及一种基于有监督主题模型的文本分类方 法及系统。The present disclosure relates to the field of data classification, and in particular, to a text classification method and system based on a supervised topic model.
背景技术Background technique
本部分的陈述仅仅是提供了与本公开相关的背景技术信息,不必然构成在先 技术。The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art.
文本表示是文本挖掘的重要步骤,目前最广泛的文本表示方法是词袋法 (Bag-of-word,BOW)。词袋法将一篇文本看作是词的集合,并假设每个词的出现 是独立的,不依赖于其它词,且忽略词序、句法等信息。基于BOW,一篇文本用 一个n维向量表示,每一维对应一个词,通常是该词的频度相关的权重,这就是 最常用的是向量空间模型(vectorspace model,VSM)。由于自然语言的复杂性, 文本表示存在“维数灾难”、“稀疏性”、“语义丢失”等诸多问题。词袋法忽略词序、 句法等信息,使得词的语义信息难以抽取和量化,文本的语义表示目前仍是非常 困难的。Text representation is an important step in text mining, and the most widely used text representation method is Bag-of-word (BOW). The bag-of-words method regards a text as a collection of words, and assumes that the occurrence of each word is independent, does not depend on other words, and ignores information such as word order and syntax. Based on BOW, a text is represented by an n-dimensional vector, each dimension corresponds to a word, usually the frequency-related weight of the word, which is the most commonly used vector space model (VSM). Due to the complexity of natural language, text representation has many problems such as "curse of dimensionality", "sparseness", and "semantic loss". The bag-of-words method ignores word order, syntax and other information, making it difficult to extract and quantify the semantic information of words, and the semantic representation of text is still very difficult.
Mikolov等人提出的word2vec模型,是一种词向量的训练方法,利用词的 上下文信息将一个词转化成一个低维实数向量,越相似的词在向量空间中越接近。 word2vec模型训练输出的是每个词的词向量,文本的所有词的词向量形成文本 向量。基于word2vec模型训练的词向量文本输入深度神经网络,成功用于中文 分词、POS tagging、情感分类、句法依存关系等方面。word2vec模型能够解决 “稀疏性”问题,虽然word2vec能够量化词与词的相似度,但并不能解决文本的“语 义丢失”和“维度灾难”问题。The word2vec model proposed by Mikolov et al. is a word vector training method, which uses the context information of the word to convert a word into a low-dimensional real number vector. The more similar words are in the vector space, the closer they are. The word2vec model training output is the word vector of each word, and the word vector of all words of the text forms the text vector. The word vector text based on the word2vec model training is input into the deep neural network, which has been successfully used in Chinese word segmentation, POS tagging, sentiment classification, syntactic dependencies, etc. The word2vec model can solve the "sparseness" problem. Although word2vec can quantify the similarity between words, it cannot solve the "semantic loss" and "dimension disaster" problems of text.
主题模型(topic model)是可用于解决“维度灾难”、“稀疏性”的一种方法,而 且能够在一定程度上抽取词的语义信息。主题模型起源于隐性语义索引(Latent SemanticIndexing,LSI),以及由Hofmann提出的概率隐性语义索引(probabilistic LatentSemantic Indexing,pLSI)。在pLSI基础上,Blei等人提出了LDA(Latent DirichletAllocation)主题模型。LDA中主题看作是词的概率分布,语义相近的词, 通过隐含主题建立关联,能够从文本中抽取出语义信息,将文本表示从高维词空 间变换到低维主题空间。主题模型直接或扩展使用在自然语言处理领域,如聚类 和分类、词义消歧、情感分析等,图像处理领域的目标发现与定位、图像分割等 任务。The topic model is a method that can be used to solve the "curse of dimensionality" and "sparseness", and can extract the semantic information of words to a certain extent. Topic models originated from Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) proposed by Hofmann. On the basis of pLSI, Blei et al. proposed the LDA (Latent DirichletAllocation) topic model. The topic in LDA is regarded as the probability distribution of words. Words with similar semantics can be associated through implicit topics, and semantic information can be extracted from the text, and the text representation can be transformed from a high-dimensional word space to a low-dimensional topic space. Topic models are used directly or by extension in the field of natural language processing, such as clustering and classification, word sense disambiguation, sentiment analysis, etc., target discovery and localization in the field of image processing, image segmentation and other tasks.
LDA主题模型将文本表示从高维的词空间变换到低维的主题空间,然后采 用KNN、Naive Bayesian、SVM等算法直接分类,其效果并不好。原因在于LDA 主题模型是无监督学习,不考虑文本的类别,并没有利用训练文本已标注的类别 这一重要信息。The LDA topic model transforms the text representation from a high-dimensional word space to a low-dimensional topic space, and then uses KNN, Naive Bayesian, SVM and other algorithms for direct classification, and the effect is not good. The reason is that the LDA topic model is unsupervised learning, does not consider the category of the text, and does not take advantage of the important information of the category that the training text has annotated.
现有的改进方法,如Li等人提出了Labled-LDA模型,发明人发现该模型针 对每类文档训练一个LDA模型,需要估计的参数增加了多倍,增加了模型的复 杂性。The existing improved methods, such as Li et al. proposed the Labeled-LDA model, the inventor found that this model trains an LDA model for each type of document, and the parameters that need to be estimated increase many times, increasing the complexity of the model.
发明内容SUMMARY OF THE INVENTION
根据本公开的一个或多个实施例的一个方面,提供一种基于有监督主题模型 的文本分类方法,其能够识别主题-类别之间的语义关系,建立主题与类别的精 确映射。According to an aspect of one or more embodiments of the present disclosure, there is provided a text classification method based on a supervised topic model, which can identify the semantic relationship between topics and categories, and establish an accurate mapping between topics and categories.
本公开的一个或多个实施例,提供的一种基于有监督主题模型的文本分类方 法,包括:One or more embodiments of the present disclosure provide a text classification method based on a supervised topic model, including:
构建SLDA-TC文本分类模型,SLDA-TC文本分类模型的训练文档集的每 个文档带有类别标签;SLDA-TC文本分类模型中需要估计的参数不仅包括文本- 主题概率分布、主题-词概率分布,还包括主题-类别概率分布;Build the SLDA-TC text classification model. Each document in the training document set of the SLDA-TC text classification model has a class label; the parameters that need to be estimated in the SLDA-TC text classification model include not only text-topic probability distribution, topic-word probability distributions, including topic-category probability distributions;
训练SLDA-TC文本分类模型,按照SLDA-TC-Gibbs算法进行SLDA-TC模 型参数估计;其中,按照SLDA-TC-Gibbs算法进行SLDA-TC模型参数估计的 过程为:对每个词的隐含主题进行采样,且只从与该词所在文本类别标签相同的 其它训练文本中进行隐含主题采样;在确定每个词的隐含主题之后,通过统计主 题-词、文档-主题、主题-类别的频次,计算得到文本-主题概率分布、主题-词概 率分布和主题-类别概率分布,进而建立出主题与类别之间的准确映射;The SLDA-TC text classification model is trained, and the parameters of the SLDA-TC model are estimated according to the SLDA-TC-Gibbs algorithm. The topic is sampled, and only the implicit topic is sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, by statistical topic-word, document-topic, topic-category The frequency of text-topic, topic-word probability distribution and topic-category probability distribution are calculated, and then an accurate mapping between topics and categories is established;
待测文本主题推断和分类;将待测文本输入至训练完成的SLDA-TC文本分 类模型,首先对待测文档每个词进行隐含主题采样;然后推断待测文本的主题概 率分布;根据待测文档的主题分布和SLDA-TC模型的主题-类别分布,输出待测 文本的类别标签。Inference and classification of the subject of the text to be tested; input the text to be tested into the trained SLDA-TC text classification model, first perform implicit topic sampling for each word of the document to be tested; then infer the subject probability distribution of the text to be tested; The topic distribution of the document and the topic-category distribution of the SLDA-TC model, output the category label of the text to be tested.
在一个或多个实施例中,文本-主题概率分布、主题-词概率分布和主题-类别 概率分布均服从Dirichlet分布。In one or more embodiments, the text-topic probability distribution, topic-word probability distribution, and topic-category probability distribution all obey the Dirichlet distribution.
在一个或多个实施例中,通过多次迭代训练生成用于分类的SLDA-TC模型, 迭代结束,通过JS散度评估主题之间的相似度、通过SLDA-TC的主题-类别分 布参数评估主题与类别之间的语义相关度。In one or more embodiments, the SLDA-TC model for classification is generated through multiple iterations of training, and the iteration is over, and the similarity between topics is evaluated by JS divergence, and the topic-category distribution parameter of SLDA-TC is evaluated. Semantic relevance between topics and categories.
在一个或多个实施例中,SLDA-TC文本分类模型的分类结果的评价指标, 包括宏平均分类精度(Macro-precision)、宏平均召回率(Macro-Recall)和宏平 均F1值(Macro-F1)。In one or more embodiments, the evaluation indicators of the classification results of the SLDA-TC text classification model include macro-average classification precision (Macro-precision), macro-average recall rate (Macro-Recall), and macro-average F1 value (Macro-F1 ).
本公开的一个或多个实施例,还提供了一种文本分类系统,包括文本输入装 置、控制器和显示装置,所述控制器包括存储器和处理器,所述存储器存储有计 算机程序,所述程序被处理器执行时能够实现以下步骤:One or more embodiments of the present disclosure further provide a text classification system, including a text input device, a controller and a display device, the controller includes a memory and a processor, the memory stores a computer program, the A program can perform the following steps when executed by a processor:
构建SLDA-TC文本分类模型,SLDA-TC文本分类模型的训练文档集的每 个文档带有类别标签;SLDA-TC文本分类模型中需要估计的参数不仅包括文本- 主题概率分布、主题-词概率分布,还包括主题-类别概率分布;Build the SLDA-TC text classification model. Each document in the training document set of the SLDA-TC text classification model has a class label; the parameters that need to be estimated in the SLDA-TC text classification model include not only text-topic probability distribution, topic-word probability distributions, including topic-category probability distributions;
训练SLDA-TC文本分类模型,按照SLDA-TC-Gibbs算法进行SLDA-TC模 型参数估计;其中,按照SLDA-TC-Gibbs算法进行SLDA-TC模型参数估计的 过程为:对每个词的隐含主题进行采样,且只从与该词所在文本类别标签相同的 其它训练文本中进行隐含主题采样;在确定每个词的隐含主题之后,通过统计主 题-词、文档-主题、主题-类别的频次,计算得到文本-主题概率分布、主题-词概 率分布和主题-类别概率分布,进而建立出主题与类别之间的准确映射;The SLDA-TC text classification model is trained, and the parameters of the SLDA-TC model are estimated according to the SLDA-TC-Gibbs algorithm. The topic is sampled, and only the implicit topic is sampled from other training texts with the same text category label as the word; after determining the implicit topic of each word, by statistical topic-word, document-topic, topic-category The frequency of text-topic, topic-word probability distribution and topic-category probability distribution are calculated, and then an accurate mapping between topics and categories is established;
待测文本主题推断和分类;将待测文本输入至训练完成的SLDA-TC文本分 类模型,首先对待测文档每个词进行隐含主题采样;然后推断待测文本的主题概 率分布;根据待测文档的主题分布和SLDA-TC模型的主题-类别分布,输出待测 文本的类别标签。Inference and classification of the subject of the text to be tested; input the text to be tested into the trained SLDA-TC text classification model, first perform implicit topic sampling for each word of the document to be tested; then infer the subject probability distribution of the text to be tested; The topic distribution of the document and the topic-category distribution of the SLDA-TC model, output the category label of the text to be tested.
本公开的有益效果是:The beneficial effects of the present disclosure are:
本公开的文本分类方法和系统,通过构建及训练完成的SLDA-TC文本分类 模型,利用文本-主题概率分布、主题-词概率分布和主题-类别概率分布,抽取词 与主题、文档与主题、主题与类别之间隐含的语义信息映射,而且主题数量K 只需取略大于类别数量C,不仅提高了文本分类精度,而且能够提高时间效率。The text classification method and system of the present disclosure, by constructing and training the completed SLDA-TC text classification model, utilizes text-topic probability distribution, topic-word probability distribution and topic-category probability distribution to extract words and topics, documents and topics, The implicit semantic information mapping between topics and categories, and the number of topics K only needs to be slightly larger than the number of categories C, which not only improves the accuracy of text classification, but also improves time efficiency.
附图说明Description of drawings
构成本公开的一部分的说明书附图用来提供对本公开的进一步理解,本公开 的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。The accompanying drawings forming a part of the present disclosure are used to provide a further understanding of the present disclosure, and the exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure.
图1为本公开的一种SLDA-TC文本分类方法流程图。FIG. 1 is a flowchart of an SLDA-TC text classification method disclosed in the present disclosure.
图2为LDA主题模型。Figure 2 shows the LDA topic model.
图3为SLDA-TC文本分类模型。Figure 3 shows the SLDA-TC text classification model.
图4(a)为20news-rec数据集C=4,K=8时,SLDA-TC与LDA-TC、SVM的 分类结果的Macro-Precision比较。Figure 4(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the 20news-rec data set C=4, K=8.
图4(b)为20news-rec数据集C=4,K=8时,SLDA-TC与LDA-TC、SVM的 分类结果的Macro-Recall比较。Figure 4(b) shows the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the 20news-rec data set C=4, K=8.
图4(c)为20news-rec数据集C=4,K=8时,SLDA-TC与LDA-TC、SVM的 分类结果的Macro-F1比较。Figure 4(c) is the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the 20news-rec data set C=4, K=8.
图5(a)为sogou数据集C=5,K=8时,SLDA-TC与LDA-TC、SVM的分类结 果的Macro-Precision比较。Figure 5(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC, and SVM when the sogou dataset is C=5 and K=8.
图5(b)为sogou数据集C=5,K=8时,SLDA-TC与LDA-TC、SVM的分类结 果的Macro-Recall比较。Figure 5(b) shows the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC and SVM when the sogou data set C=5 and K=8.
图5(c)为sogou数据集C=5,K=8时,SLDA-TC与LDA-TC、SVM的分类结 果的Macro-F1比较。Figure 5(c) shows the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC and SVM when the sogou dataset C=5 and K=8.
图6(a)为20news-sci数据集C=4,当K=90时,SLDA-TC与LDA-TC、SVM 的分类结果的Macro-Precision比较。Figure 6(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC, and SVM for the 20news-sci dataset C=4, when K=90.
图6(b)为20news-sci数据集C=4,当K=90时,SLDA-TC与LDA-TC、SVM 的分类结果的Macro-Recall比较。Figure 6(b) shows the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC, and SVM for the 20news-sci dataset C=4, when K=90.
图6(c)为20news-sci数据集C=4,当K=90时,SLDA-TC与LDA-TC、SVM 的分类结果的Macro-F1比较。Figure 6(c) shows the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC and SVM when K=90 for the 20news-sci dataset C=4.
图7(a)为20news-talk数据集C=3,K=90时,SLDA-TC与LDA-TC、SVM的 分类结果的Macro-Precision比较。Figure 7(a) shows the Macro-Precision comparison of the classification results of SLDA-TC, LDA-TC and SVM when the 20news-talk dataset C=3, K=90.
图7(b)为20news-talk数据集C=3,K=90时,SLDA-TC与LDA-TC、SVM的 分类结果的Macro-Recall比较。Figure 7(b) is the Macro-Recall comparison of the classification results of SLDA-TC, LDA-TC and SVM when the 20news-talk dataset C=3, K=90.
图7(c)为20news-talk数据集C=3,K=90时,SLDA-TC与LDA-TC、SVM的 分类结果的Macro-F1比较。Figure 7(c) is the Macro-F1 comparison of the classification results of SLDA-TC, LDA-TC and SVM when the 20news-talk dataset C=3, K=90.
具体实施方式Detailed ways
应该指出,以下详细说明都是例示性的,旨在对本公开提供进一步的说明。 除非另有指明,本文使用的所有技术和科学术语具有与本公开所属技术领域的普 通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限 制根据本公开的示例性实施方式。如在这里所使用的,除非上下文另外明确指出, 否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使 用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或 它们的组合。It should be noted that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the exemplary embodiments in accordance with the present disclosure. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, and it should also be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.
术语解释:Terminology Explanation:
Dirichlet分布:狄利克雷分布是一组连续多变量的概率分布,是多变量普遍 化的Β分布,狄利克雷分布常作为贝叶斯统计的先验概率。Dirichlet distribution: Dirichlet distribution is a group of continuous multivariate probability distributions, and is a multivariate generalized β distribution. Dirichlet distribution is often used as the prior probability of Bayesian statistics.
Gibbs Sampling方法:吉布斯采样,是基于马尔科夫蒙特卡洛(MCMC)的 一种算法,用于在难以直接采样时从某一多变量概率分布中近似抽取样本序列。Gibbs Sampling: Gibbs sampling, an algorithm based on Markov Monte Carlo (MCMC), is used to approximate sample sequences from a multivariate probability distribution when direct sampling is difficult.
LDA-TC:是指先通过LDA主题模型抽取一定数目的主题,然后根据LDA模 型进行分类的方法。LDA-TC: refers to the method of first extracting a certain number of topics through the LDA topic model, and then classifying them according to the LDA model.
SVM:Support Vector Machine,指的是支持向量机,是常见的一种判别方法。 在机器学习领域,是一个有监督的学习模型,通常用来进行模式识别、分类以及 回归分析。SVM: Support Vector Machine, which refers to the support vector machine, is a common discrimination method. In the field of machine learning, it is a supervised learning model, which is usually used for pattern recognition, classification and regression analysis.
Macro-Precision:宏平均分类精度。Macro-Precision: Macro average classification accuracy.
Macro-Recall:宏平均召回率。Macro-Recall: Macro average recall.
Macro-F1:宏平均F1值。Macro-F1: Macro average F1 value.
图1为本公开的一种基于有监督主题模型的文本分类方法流程图。FIG. 1 is a flowchart of a text classification method based on a supervised topic model according to the present disclosure.
如图1所示,本实施例的一种文本分类方法,包括:As shown in FIG. 1, a text classification method of this embodiment includes:
S110:构建SLDA-TC文本分类模型;不同于无监督的LDA模型,SLDA-TC 文本分类模型的训练文档集的每个文档带有类别标签;SLDA-TC文本分类模型 中待估计的参数包括文本-主题概率分布、主题-词概率分布和主题-类别概率分布。S110: Build a SLDA-TC text classification model; unlike the unsupervised LDA model, each document in the training document set of the SLDA-TC text classification model has a class label; the parameters to be estimated in the SLDA-TC text classification model include text - Topic probability distribution, topic-word probability distribution and topic-category probability distribution.
如图2所示,LDA主题模型的文本集合M表示文本数,K表示 主题数。模型有两个参数θm和其中θm表示第m篇文本的主题概率分布, 表示主题k的词概率分布,wm是第m篇文本的词袋向量,Nm表示第m篇文 本的长度,wm,n表示第m篇文本中第n个词,zm,n是分配给wm,n的主题。θm和 服从Dirichlet分布,作为多项式参数分别生成主题、词,α、β是相应的Dirichlet 分布的先验参数。LDA主题模型不考虑每个文档的类别。As shown in Figure 2, the text collection of the LDA topic model M represents the number of texts, and K represents the number of topics. The model has two parameters θ m and where θm represents the topic probability distribution of the mth text, Represents the word probability distribution of topic k, w m is the word bag vector of the m-th text, N m represents the length of the m-th text, w m,n represents the n-th word in the m-th text, z m,n is Topic assigned to w m,n . θm and Subject to Dirichlet distribution, topics and words are generated as multinomial parameters, α and β are the prior parameters of the corresponding Dirichlet distribution. The LDA topic model does not consider the category of each document.
如图3所示,与LDA不同,SLDA-TC模型的训练文本集每篇文本wm存在一个可观察的类别号ym∈[1,C],C表示类别数,假设类别号服 从与文本主题概率相关的多项式分布。wm,n和ym是可观察的,zm,n是隐含主题。 SLDA-TC模型除了参数和θm,本公开引入了一个新参数δk,表示第k个主 题的类别概率分布。θm、和δk服从Dirichlet分布,作为多项式参数分别生成 主题、词和类别,α、β和γ是相应的Dirichlet分布的先验参数。As shown in Figure 3, unlike LDA, the training text set of SLDA-TC model Each text w m has an observable category number y m ∈ [1, C], C represents the number of categories, assuming that the category number obeys a multinomial distribution related to the text topic probability. w m,n and y m are observables and z m,n are implicit subjects. SLDA-TC model in addition to parameters and θ m , the present disclosure introduces a new parameter δ k , representing the class probability distribution of the k-th topic. θ m , and δk obey the Dirichlet distribution as multinomial parameters to generate topics, words and categories, respectively, α, β and γ are the prior parameters of the corresponding Dirichlet distribution.
S120:SLDA-TC主题模型参数估计,训练SLDA-TC文本分类模型,建立 主题与类别之间的映射。S120: Estimate the parameters of the SLDA-TC topic model, train the SLDA-TC text classification model, and establish the mapping between topics and categories.
LDA采用Gibbs Sampling算法估计的参数是和θ,SLDA-TC模型需要估 计的参数是θ和δ。本公开在Gibbs Sampling方法的基础上提出了 SLDA-TC-Gibbs算法,不直接计算θm、和δk,先对每个词的隐含主题进行采 样,在确定每个词的隐含主题之后,θm、和δk可以统计频次计算得到。The parameters estimated by LDA using Gibbs Sampling algorithm are and θ, the parameters that the SLDA-TC model needs to estimate are theta and delta. The present disclosure proposes the SLDA-TC-Gibbs algorithm on the basis of the Gibbs Sampling method, and does not directly calculate θ m , and δ k , first sample the implicit topic of each word, after determining the implicit topic of each word, θ m , and δ k can be calculated by statistical frequency.
Gibbs Sampling算法每次对词w一个隐变分量zi=k进行采样,保持z的其它 词分量即不变,计算如式(1)。The Gibbs Sampling algorithm samples a latent variable component zi = k of the word w each time, and keeps the other word components of z, namely unchanged, the calculation is as in formula (1).
SLDA-TC-Gibbs算法每次对词wi=t的一个隐变分量zi=k进行采样时,保持 z的其它词分量即不变,同时也保持不变,计算如式(2)。Each time the SLDA-TC-Gibbs algorithm samples a latent variable component zi = k of the word w i =t, it keeps the other word components of z, namely unchanged, while maintaining unchanged, the calculation is as in formula (2).
其中,w=(w1,...,wm)是所有文档词向量,z=(z1,...,zV)是主题向量(V是词 字典长度),y=(y1,...yM)是类别向量。假设第i个单词wi=t,zi表示第i个单词 对应的主题变量,表示剔除向量z的第i项。同时假设第i个单词wi=t所在 文档是第m篇文档wm,且其类标号为ym=j,j∈[1,C],那么表示剔除向量y 的第m项。表示主题k分配给词v的次数,βv表示词v的Dirichlet先验。表示文档m分配给主题z的次数,αz是主题z的Dirichlet先验。表示剔除z 的第i项(即第i个词wi=t),主题k分配给词t的次数,βv是词t的Dirichlet先 验。表示剔除第m项(第m篇文档wm的类标记为ym=j),主题k分配为 类别j的文档数,γj表示类别j的Dirichlet先验参数。where w=(w 1 ,...,w m ) is the word vector for all documents, z=(z 1 ,...,z V ) is the topic vector (V is the word dictionary length), y=(y 1 ,...y M ) is the category vector. Assuming that the ith word w i =t, zi represents the topic variable corresponding to the ith word, Represents the ith item of the culling vector z. At the same time, it is assumed that the document where the i-th word w i =t is located is the m-th document w m , and its class label is y m =j,j∈[1,C], then Represents the mth item of the culling vector y. represents the number of times topic k is assigned to word v, and β v represents the Dirichlet prior of word v. represents the number of times document m is assigned to topic z, α z is the Dirichlet prior of topic z. represents the number of times the topic k is assigned to the word t, excluding the ith item of z (ie the ith word wi = t), and β v is the Dirichlet prior of the word t. Indicates that the m-th item is removed (the class of the m-th document w m is marked as y m =j), the topic k is assigned as the number of documents in the category j, and γ j represents the Dirichlet prior parameter of the category j.
对比式(1),在式(2)左边引入右边增加一项对第m篇 文档中词的隐含主题采样进行了限制,即只从与第m篇文档类别相同的其它训 练文档进行隐含主题采样,因为相同类别的文档的主题分布是相似的,所以对有 类别标注时公式(2)更合理。本公开只生成一个SLDA-TC模型,只需估计一 组参数θ和δ。公式(2)的推导证明如下:Compared with formula (1), it is introduced on the left side of formula (2) Add an item to the right The implicit topic sampling of words in the mth document is restricted, that is, the implicit topic sampling is only performed from other training documents of the same category as the mth document, because the topic distributions of the documents of the same category are similar, so Formula (2) is more reasonable when there are category labels. The present disclosure generates only one SLDA-TC model and only needs to estimate a set of parameters theta and delta. The derivation of formula (2) is proved as follows:
证明:给定训练文档集令w=(w1,...,wm),y=(y1,...yM), z=(z1,...,zV),SLDA-TC概率模型的联合分布如(3)式。Proof: Given a set of training documents Let w=(w 1 ,...,w m ), y=(y 1 ,... y M ), z=(z 1 ,...,z V ), the joint distribution of the SLDA-TC probability model Such as (3) formula.
p(w,z,y|α,β,γ)=p(w|z,β)p(z|α)p(y|z,γ) (3)p(w,z,y|α,β,γ)=p(w|z,β)p(z|α)p(y|z,γ) (3)
由Dirichlet分布可知:According to the Dirichlet distribution:
其中,Γ(·)是 Gamma函数。in, Γ(·) is a Gamma function.
根据SLDA-TC-Gibbs算法,According to the SLDA-TC-Gibbs algorithm,
对第m篇文档的词wi=t,是常量的,故For the word w i =t of the mth document, is constant, so
由(3)-(6)和(8),可得From (3)-(6) and (8), we can get
由此,(2)式可证。Therefore, (2) can be proved.
获得每个单词w的主题z标号后,SLDA-TC模型的参数θm和δk计算如 下。After obtaining the topic z label of each word w, the parameters of the SLDA-TC model θ m and δ k are calculated as follows.
其中表示主题k分配给词t的概率,θm,k表示文档m分配为主题k的概 率,δk,j表示主题k属于类别j的概率,为表示主题k分配给词t的次数,表示文档m分配给主题k的次数,表示主题是k分配为类别j的文档数,k=1..K, t=1..V,m=1..M。in represents the probability that topic k is assigned to word t, θ m,k represents the probability that document m is assigned to topic k, δ k,j represents the probability that topic k belongs to category j, to represent the number of times that topic k is assigned to word t, represents the number of times document m is assigned to topic k, Denotes the number of documents whose topic is k assigned to category j, k=1..K, t=1..V, m=1..M.
由此SLDA-TC模型的参数θ和δ估算完毕。From this the parameters of the SLDA-TC model θ and δ are estimated.
SLDATC-Gibbs算法描述:SLDATC-Gibbs algorithm description:
算法: algorithm:
输入:文档向量超参数α,β,γ,主题数K,迭代次数T.Input: document vector Hyperparameters α, β, γ, number of topics K, number of iterations T.
输出:主题z分布,参数θ和δ.Output: topic z distribution, parameters theta and delta.
初始化:变量初始化为0.Initialize variables Initialized to 0.
训练生成SLDA-TC模型后,对第d篇新文档中每个词的隐含主题推断公式 如(13)。After training and generating the SLDA-TC model, the implicit topic inference formula for each word in the d-th new document is shown in (13).
其中,表示第d篇新文档的词向量,代表其主题向量,表示新文档 剔除第i项(即第i个词),主题k分配给词t的次数,表示剔除第i项, 第d篇新文档分配给主题k的次数,其它符号含义参考公式(2)。in, represents the word vector of the d-th new document, represents its topic vector, Indicates that the new document removes the i-th item (ie the i-th word ), the number of times topic k is assigned to word t, Represents the number of times that the i-th item is removed and the d-th new document is assigned to the topic k. Refer to formula (2) for the meaning of other symbols.
由式(13)计算新文档d每个词的隐含主题标号,然后计算d属于各个主题 的概率如式(14)。Calculate the implicit topic label of each word in the new document d by equation (13), and then calculate the probability that d belongs to each topic as in equation (14).
由此,得到新文档d的主题概率分布 From this, the topic probability distribution of the new document d is obtained
S130:将待测文本输入至训练完成的SLDA-TC文本分类模型,推断出待测 文本的主题,进而预测出文本分类。S130: Input the text to be tested into the trained SLDA-TC text classification model, infer the subject of the text to be tested, and then predict the text classification.
给定训练好SLDA-TC模型,令为第d篇新文档的词向量,为其主题向 量,是对新文档d的类别预测,新文档的分类计算如下。Given a trained SLDA-TC model, let is the word vector of the d-th new document, is its theme vector, is the class prediction for the new document d, and the classification of the new document is calculated as follows.
这里假设测试样本集与训练样本集服从同分布,即二者在隐含主题-类别分 布上一致,因此,可以由p(y|z)代替,由SLDA-TC模型的参数δ揭示。 是新文档d的主题概率分布由式(12)、(14)和(15),可得:It is assumed here that the test sample set and the training sample set obey the same distribution, that is, the two are consistent in the implicit topic-category distribution. Therefore, can be replaced by p(y|z), revealed by the parameter δ of the SLDA-TC model. is the topic probability distribution of the new document d From equations (12), (14) and (15), we can get:
主题之间的相似度用JS散度(Jensen-Shannon divergence)评估,JS散度也 称JS距离,是KL散度(Kullback–Leibler divergence)的一种变形,计算如公式(17)。 不同于KL散度,JS散度满足对称性和三角距离公式。The similarity between topics is evaluated by JS divergence (Jensen-Shannon divergence). JS divergence, also called JS distance, is a variant of KL divergence (Kullback–Leibler divergence), calculated as formula (17). Unlike the KL divergence, the JS divergence satisfies the symmetry and triangular distance formulas.
JS(pi||pj)=0.5*KL(pi||(pi+pj)/2)+0.5*KL(pj||(pi+pj)/2) (17)JS(pi ||p j )=0.5*KL(pi ||( pi +p j )/2) +0.5*KL(p j ||(pi +p j ) / 2 ) (17)
其中,pi和pj分别表示主题i和主题j的词概率分布,JS散度的值域范围是[0,1],0表示pi和pj分布相同,1表示相反。Among them, p i and p j represent the word probability distribution of topic i and topic j respectively, the range of JS divergence is [0, 1], 0 means p i and p j have the same distribution, and 1 means opposite.
主题与类别之间的语义相关度由SLDA-TC模型的参数δ度量,计算如式(12) 所示,δk,j表示主题k属于类别j的概率。The semantic relevance between topics and categories is measured by the parameter δ of the SLDA-TC model, which is calculated as shown in equation (12), where δ k,j represents the probability that topic k belongs to category j.
在一个或多个实施例中,通过多次迭代训练生成用于分类的SLDA-TC模型, 迭代结束,通过JS散度评估主题之间的相似度,通过SLDA-TC的主题-类别分 布参数评估主题与类别之间的语义相关度;SLDA-TC模型的文本分类结果评价 指标宏平均分类精度(Macro-precision)、宏平均召回率(Macro-Recall)和宏平 均F1值(Macro-F1)。In one or more embodiments, the SLDA-TC model for classification is generated through multiple iterations of training, and the iteration ends, the similarity between topics is evaluated by JS divergence, and the topic-category distribution parameter of SLDA-TC is evaluated Semantic relevance between topics and categories; SLDA-TC model text classification results evaluation metrics Macro-precision, macro-recall (Macro-Recall), and macro-average F1 value (Macro-F1).
实验分析验证:Experimental analysis and verification:
选取英文数据集20newsgroup的rec、sci和talk三个数据子集,以及包含IT、 军事、教育、旅游和财经5个类的搜狗中文语料数据子集,每个数据子集中训练 样本和测试样本比例为8:2,数据子集描述如表1所述。中英文数据集的分词采 用jieba分词,英文词干提取采用的是nltk.stem,去除停用词后,采用TF-IDF进 行特征选择,实验中保留60%特征词。Select three data subsets of rec, sci and talk from the English data set 20newsgroup, as well as the Sogou Chinese corpus data subsets containing 5 categories of IT, military, education, tourism and finance, and the proportion of training samples and test samples in each data subset is 8:2, and the data subsets are described in Table 1. The Chinese and English datasets are segmented by jieba, and the English stem is extracted by nltk.stem. After removing stop words, TF-IDF is used for feature selection, and 60% of the feature words are retained in the experiment.
表1.数据集描述Table 1. Dataset description
为验证所提方法的有效性,对SLDA-TC、LDA-TC和SVM三个算法进行了 实验比较。SLDA-TC是本公开中提出的算法,LDA-TC是在传统LDA直接分类, SVM是采用LDA模型的K个主题作为特征的SVM分类算法。In order to verify the effectiveness of the proposed method, the three algorithms of SLDA-TC, LDA-TC and SVM are compared experimentally. SLDA-TC is an algorithm proposed in this disclosure, LDA-TC is a direct classification in traditional LDA, and SVM is an SVM classification algorithm that uses K topics of the LDA model as features.
SLDA-TC模型的主题推断的目标是建立主题与类别之间的映射,主题数K 与标记训练集的类别数C有关,实验表明K只需取略大于C的一个值,SLDA-TC 就可以达到很好的分类精度。实验中,我们对SLDA-TC主题模型生成的主题间 的JS距离、每个主题的前10个特征词的概率分布、类与主题间的相似度进行了 分析。如表2~表4描述的是sogou数据子集C=5、K=8时生成的SLDA-TC主题 模型的实验结果,表6~表8描述的是20news-talk数据子集上的实验结果,其中 SLDA-TC主题模型的α、β和γ取值为0.01。The goal of the topic inference of the SLDA-TC model is to establish a mapping between topics and categories. The number of topics K is related to the number of categories C in the labeled training set. Experiments show that K only needs to take a value slightly larger than C, and SLDA-TC can achieve good classification accuracy. In the experiment, we analyzed the JS distance between topics generated by the SLDA-TC topic model, the probability distribution of the top 10 feature words of each topic, and the similarity between classes and topics. Tables 2 to 4 describe the experimental results of the SLDA-TC topic model generated when the sogou data subset C=5 and K=8, and Tables 6 to 8 describe the experimental results on the 20news-talk data subset , where α, β and γ of the SLDA-TC topic model are 0.01.
表2主题间JS散度(SLDA-TC,sogou,C=5,K=8)Table 2 JS divergence between subjects (SLDA-TC, sogou, C=5, K=8)
表3主题与类的相关度(SLDA-TC,sogou,C=5,K=8)Table 3 Relevance between topics and classes (SLDA-TC, sogou, C=5, K=8)
表4主题前10个词的概率分布(SLDA-TC,sogou,C=5,K=8)Table 4 Probability distribution of the top 10 words of the topic (SLDA-TC, sogou, C=5, K=8)
如表2所示,主题2、5、7之间的JS散度为0,说明它们是分布相同的主 题,从表4可以看出,这3个主题前10个特征词的概率分布也是相同的。其它 5个主题之间的JS散度在0.45至0.58之间,说明是分散的5个不同主题。从表 3和表4可以看出,主题0映射类别“IT”,主题1映射类别“旅游”,主题3映射 类别“财经”,主题4映射类别“军事”,主题6映射类别“教育”,相似度均在99% 以上,而主题2、5、7是与任何类别无关的,称之为“无用”主题。As shown in Table 2, the JS divergence between topics 2, 5, and 7 is 0, indicating that they are topics with the same distribution. It can be seen from Table 4 that the probability distribution of the top 10 feature words of these three topics is also the same of. The JS divergence between the other 5 topics is between 0.45 and 0.58, indicating that there are 5 different topics scattered. As can be seen from Tables 3 and 4, topic 0 maps to the category "IT", topic 1 maps the category "tourism", topic 3 maps the category "Finance", topic 4 maps the category "military", topic 6 maps the category "education", The similarity is above 99%, and topics 2, 5, and 7 are irrelevant to any category, and are called "useless" topics.
表5.主题间JS散度(SLDA-TC 20news-talk,C=3,K=6)Table 5. Inter-topic JS divergence (SLDA-TC 20news-talk, C=3, K=6)
表6类与主题的相关度(SLDA-TC,20news-talk,C=3,K=6)Table 6. Relevance between categories and topics (SLDA-TC, 20news-talk, C=3, K=6)
表7主题前10个词的概率分布(SLDA-TC,20news-talk,C=3,K=6)Table 7 Probability distribution of the top 10 words of the topic (SLDA-TC, 20news-talk, C=3, K=6)
表5~表7所示的是20news-talk上的实验结果,主题1、3和4是分布相同 的“无用”主题,有意义的主题0对应类别talk.politics.guns,主题2对应 talk.politics.misc,主题5对应talk.politics.mideast。Tables 5 to 7 show the experimental results on 20news-talk. Topics 1, 3 and 4 are "useless" topics with the same distribution, meaningful topic 0 corresponds to the category talk.politics.guns, and topic 2 corresponds to talk. politics.misc, topic 5 corresponds to talk.politics.mideast.
经过大量实验我们发现,当K-C≥2时,会产生K-C个JS散度为0的主题的 “无用”主题,因此K只要选择略大于C的值即可,所以SLDA-TC主题模型的K 是容易确定的。SLDA-TC主题模型能够过滤掉K-C个“无用”主题,建立主题与 类别的准确映射。同时,K只略大于C,远低于LDA的K值,能显著降低模型 的训练时间。After a lot of experiments, we found that when K-C≥2, K-C "useless" topics of topics with JS divergence of 0 will be generated, so K only needs to choose a value slightly larger than C, so the K of the SLDA-TC topic model is easily determined. The SLDA-TC topic model can filter out K-C "useless" topics and establish an accurate mapping between topics and categories. At the same time, K is only slightly larger than C, much lower than the K value of LDA, which can significantly reduce the training time of the model.
图4(a)~图7(c)描述的是在sogou中文、20newsgroup的3个子数据集 上,经过TF-IDF特征选择保留60%的特征词,主题数K取不同的值迭代10次 生成SLDA模型和LDA模型,SLDA-TC、LDA-TC和SVM的分类结果比较。Figures 4(a) to 7(c) describe the three sub-data sets of sogou Chinese and 20newsgroup. After TF-IDF feature selection, 60% of the feature words are retained, and the number of topics K takes different values to iterate 10 times to generate Comparison of classification results between SLDA model and LDA model, SLDA-TC, LDA-TC and SVM.
在图4(a)~图7(c)中,横坐标均表示迭代次数。In FIGS. 4( a ) to 7 ( c ), the abscissas all represent the number of iterations.
图4(a)~图7(c)描述了4个数据集上主题数K取不同值时,SLDA-TC、 LDA-TC和SVM的分类结果比较。Figures 4(a) to 7(c) describe the comparison of the classification results of SLDA-TC, LDA-TC and SVM when the number of topics K on the four datasets takes different values.
(1)SLDA-TC的主题时K只需略大于类别数C,分类结果优于LDA-TC 和SVM。(1) The subject K of SLDA-TC only needs to be slightly larger than the number of categories C, and the classification result is better than that of LDA-TC and SVM.
如图4(a)-图4(c)所示,20news-rec数据集C=4,K=8时,对于Macro-Precision、Macro-Recall和Macro-F1分类指标,SLDA-TC为95.10%、94.99%和94.98%, 而LDA-TC为63.76%、60.91%和60.33%,SVM为68.82%、68.33%、68.08%。 随着K增大,LDA-TC和SVM有所提高,K=80时,LDA-TC最高达到为71.85%、 71.38%和71.41%,SVM最高达到83.90%、83.70%和83.62%,也低于SLDA-TC。 另外K=80,其主题模型的训练时间远远高于SLDA-TC模型K=6的训练时间。As shown in Fig. 4(a)-Fig. 4(c), when the 20news-rec dataset C=4, K=8, the SLDA-TC is 95.10% for the Macro-Precision, Macro-Recall and Macro-F1 classification metrics , 94.99%, and 94.98%, while LDA-TC was 63.76%, 60.91%, and 60.33%, and SVM was 68.82%, 68.33%, and 68.08%. With the increase of K, LDA-TC and SVM are improved. When K=80, LDA-TC reaches 71.85%, 71.38% and 71.41%, and SVM reaches 83.90%, 83.70% and 83.62%, which are also lower than SLDA-TC. In addition, K=80, the training time of its topic model is much higher than that of the SLDA-TC model K=6.
如图5(a)-图5(c)所示所示,sogou数据集C=5,K=8时,对于Macro-Precision、Macro-Recall和Macro-F1分类指标,SLDA-TC为92.80%、92.73%和92.70%, 而LDA-TC为72.67%、68.89%和67.48%,SVM为80.69%、80.40%、80.28%。 随着K的增大,SVM的分类指标逐步提高,当K大于60时,三种分类指标已 经提高为89.26%、89.95%、89.24%,但还是低于K=8时SLDA-TC的92.80%、 92.73%和92.70%,并且K=60的SVM(LDA主题为特征),比K=8的SLDA-TC 要付出更多的时间代价。As shown in Figure 5(a)-Figure 5(c), when the sogou dataset C=5, K=8, the SLDA-TC is 92.80% for the Macro-Precision, Macro-Recall and Macro-F1 classification metrics , 92.73%, and 92.70%, while LDA-TC was 72.67%, 68.89%, and 67.48%, and SVM was 80.69%, 80.40%, and 80.28%. With the increase of K, the classification index of SVM gradually improves. When K is greater than 60, the three classification indexes have been improved to 89.26%, 89.95% and 89.24%, but they are still lower than 92.80% of SLDA-TC when K=8 , 92.73% and 92.70%, and SVM with K=60 (LDA topic as feature), pays more time cost than SLDA-TC with K=8.
(2)SLDA-TC的K值不是越大越好,当K很大时,SLDA-TC的三个分类 指标反倒有所下降,这表明K只需略大于C即可。(2) The K value of SLDA-TC is not as large as possible. When K is very large, the three classification indicators of SLDA-TC decrease, which shows that K only needs to be slightly larger than C.
如图6(a)-图6(c)所示所示,20news-sci数据集C=4,当K=90时,SLDA-TC 分类结果反倒很差。原因在于类别数C=4,生成的90个主题中只有4个主题与 类别相关,其余86个是“无用”主题,对分类没有帮助,反而造成了干扰,导致 分类结果变差。如图7(a)-图7(c)所示的20news-talk数据集C=3,K=90时 也是如此。大量的实验结果表明,对于SLDA-TC算法,K取略大于C的值,即 可获得很高的分类精度。As shown in Figure 6(a)-Figure 6(c), the 20news-sci data set C=4, when K=90, the SLDA-TC classification result is very poor. The reason is that the number of categories C=4, only 4 of the generated 90 topics are related to the category, and the remaining 86 are "useless" topics, which do not help the classification, but cause interference, resulting in poor classification results. The same is true for the 20news-talk dataset C=3, K=90 as shown in Fig. 7(a)-Fig. 7(c). A large number of experimental results show that, for the SLDA-TC algorithm, if K is slightly larger than C, high classification accuracy can be obtained.
SLDA-TC与LDA-TC、SVM在不同数据集上的时间性能及分类结果比较如 表8所示。The time performance and classification results of SLDA-TC, LDA-TC and SVM on different datasets are shown in Table 8.
表8.SLDA-TC与LDA-TC、SVM时间性能及分类结果比较Table 8. Comparison of SLDA-TC and LDA-TC, SVM time performance and classification results
主题模型的生成时间与主题数K成正比,K值越大,时间代价越高。对 SLDA-TC模型,主题数K只需略大于类别数C即可获得非常好的分类结果,而 LDA-TC和采用LDA主题为特征的SVM算法则需要K值达到几十、上百时, 才能取得较好的分类结果,从图4(a)-图7(d)所示的实验结果也可以看出。The generation time of the topic model is proportional to the number of topics K. The larger the value of K, the higher the time cost. For the SLDA-TC model, the number of topics K only needs to be slightly larger than the number of categories C to obtain very good classification results, while LDA-TC and the SVM algorithm using LDA topics as features require the K value to reach tens or hundreds of times. In order to obtain better classification results, it can also be seen from the experimental results shown in Figure 4(a)-Figure 7(d).
如表8所示,在20news-rec上,LDA与SLDA-TC模型生成时间比率为4.86, 即SLDA比LDA快4.86倍,因为二者的K分别为200和8。在20news-sci、 20news-talk和sogou数据集上,SLDA比LDA分别快4.78、5.16和4.90倍。同 时,从Macro-Precision、Macro-Recall和Macro-F1分类指标来看,SLDA-TC算 法明显优于LDA-TC和SVM算法,4种数据集上,SLDA-TC比SVM高出 3.10%~9.30%,比LDA-TC高出7.10%~34.08%。As shown in Table 8, on 20news-rec, the LDA and SLDA-TC model generation time ratio is 4.86, that is, SLDA is 4.86 times faster than LDA, because the K of the two are 200 and 8, respectively. On the 20news-sci, 20news-talk and sogou datasets, SLDA is 4.78, 5.16 and 4.90 times faster than LDA, respectively. At the same time, from the perspective of Macro-Precision, Macro-Recall and Macro-F1 classification indicators, the SLDA-TC algorithm is significantly better than the LDA-TC and SVM algorithms. On the four data sets, SLDA-TC is 3.10% higher than SVM. 9.30% %, 7.10%~34.08% higher than LDA-TC.
综上所述,SLDA-TC模型主题数K只需取略大于类别数C的值,能够识别 与类别密切相关的主题,并且在分类精度和时间性能上都明显优于LDA-TC和 采用LDA主题为特征的SVM算法。To sum up, the number of topics K of the SLDA-TC model only needs to take a value slightly larger than the number of categories C, which can identify topics closely related to categories, and is significantly better than LDA-TC and LDA in both classification accuracy and time performance. Subject-based feature SVM algorithm.
本公开针对LDA在文本分类中存在的问题,提出了一种基于有监督主题模 型的SLDA-TC文本分类模型,提出了SLDA-TC-Gibbs参数估计算法,每次对 词wi=t的一个隐变分量zi=k进行采样时,保持z的其它词分量即不变,同时 也保持不变,即只从该词所在文档类标签相同的其它训练文档中进行隐含主 题采样,因为相同类别的文档的主题分布是相似的,并给出了理论证明。 SLDA-TC模型引入了主题-类别概率分布的参数δ,通过θ和δ概率分布, 抽取词与主题、文档与主题、主题与类别之间隐含的语义信息映射。另外, SLDA-TC主题数K只需略大于类别数的值。实验表明SLDA-TC模型能够明显 提高分类精度和时间效率。Aiming at the problems existing in LDA in text classification, the present disclosure proposes a SLDA-TC text classification model based on a supervised topic model, and proposes an SLDA -TC-Gibbs parameter estimation algorithm. When the latent variable component zi = k is sampled, keep the other word components of z, namely unchanged, while maintaining Invariant, that is, implicit topic sampling is only performed from other training documents with the same document class label as the word, because the topic distribution of documents of the same category is similar, and a theoretical proof is given. The SLDA-TC model introduces the parameter δ of the topic-category probability distribution, via Theta and delta probability distributions, extract the implicit semantic information mapping between words and topics, documents and topics, topics and categories. In addition, the number K of SLDA-TC topics only needs to be slightly larger than the number of categories. Experiments show that the SLDA-TC model can significantly improve the classification accuracy and time efficiency.
本公开的一个或多个实施例,还提供了一种文本分类系统,包括文本输入装 置、控制器和显示装置,所述控制器包括存储器和处理器,所述存储器存储有计 算机程序,所述程序被处理器执行时能够实现如图1所示的以下步骤:One or more embodiments of the present disclosure further provide a text classification system, including a text input device, a controller and a display device, the controller includes a memory and a processor, the memory stores a computer program, the When the program is executed by the processor, it can implement the following steps as shown in Figure 1:
(1)构建SLDA-TC文本分类模型;不同于无监督LDA模型,SLDA-TC 文本分类模型的训练文档集的每个文档带有类别标签;SLDA-TC文本分类模型 中需要估计的参数不仅包括文本-主题概率分布、主题-词概率分布,还包括主题 -类别概率分布;其中,文本主题概率分布、主题的词概率分布和主题的类别概 率分布均服从Dirichlet分布。(1) Build the SLDA-TC text classification model; unlike the unsupervised LDA model, each document in the training document set of the SLDA-TC text classification model has a class label; the parameters to be estimated in the SLDA-TC text classification model not only include Text-topic probability distribution, topic-word probability distribution, and topic-category probability distribution; among them, text topic probability distribution, topic word probability distribution, and topic category probability distribution all obey Dirichlet distribution.
(2)训练SLDA-TC文本分类模型,进行SLDA-TC模型参数估计。(2) Train the SLDA-TC text classification model and estimate the parameters of the SLDA-TC model.
具体地,在训练SLDA-TC文本分类模型的过程中,先设置主题数K,取略 大于类别数C的值;然后按照SLDA-TC-Gibbs算法对每个词的隐含主题进行采 样,且只从与该词所在文本类别标签相同的其它训练文本中进行隐含主题采样; 在确定每个词的隐含主题之后,通过统计主题-词、文档-主题、主题-类别的频次, 计算得到文本-主题概率分布、主题-词概率分布和主题-类别概率分布;建立主题 与类别之间的准确映射;Specifically, in the process of training the SLDA-TC text classification model, the number of topics K is set first, and a value slightly larger than the number of categories C is set; then the implicit topics of each word are sampled according to the SLDA-TC-Gibbs algorithm, and Only perform implicit topic sampling from other training texts with the same text category label as the word; after determining the implicit topic of each word, by counting the frequency of topic-word, document-topic, and topic-category, calculate Text-topic probability distribution, topic-word probability distribution and topic-category probability distribution; establish accurate mapping between topics and categories;
(3)待测文本主题推断和分类。将待测文本输入至训练完成的SLDA-TC 文本分类模型,首先对待测文档每个词进行隐含主题采样;然后推断待测文本的 主题概率分布;根据待测文档的主题分布和SLDA-TC模型的主题-类别分布,输 出待测文本的类别标签。(3) Topic inference and classification of the text to be tested. Input the text to be tested into the trained SLDA-TC text classification model, firstly sample the implicit topic of each word in the document to be tested; then infer the topic probability distribution of the text to be tested; according to the topic distribution of the document to be tested and SLDA-TC The topic-category distribution of the model, which outputs the category labels of the text to be tested.
(4)SLDA-TC模型和分类结果评估。对多次迭代训练生成的SLDA-TC模 型,通过JS散度评估主题之间的相似度,通过SLDA-TC的主题-类别分布参数 评估主题与类别之间的语义相关度;SLDA-TC模型的文本分类结果评价由指标 宏平均分类精度(Macro-precision)、宏平均召回率(Macro-Recall)和宏平均F1 值(Macro-F1)评估。(4) SLDA-TC model and classification result evaluation. For the SLDA-TC model generated by multiple iterative training, the similarity between topics is evaluated by JS divergence, and the semantic correlation between topics and categories is evaluated by the topic-category distribution parameter of SLDA-TC; The evaluation of text classification results is evaluated by the metrics macro-average classification precision (Macro-precision), macro-average recall (Macro-Recall), and macro-average F1 value (Macro-F1).
本公开的文本分类方法和系统,通过构建及训练完成的SLDA-TC文本分类 模型,利用文本主题概率分布、主题的词概率分布和主题的类别概率分布,抽取 词与主题、文档与主题、主题与类别之间隐含的语义信息映射,提高了文本分类 精度和时间效率。The text classification method and system of the present disclosure, by constructing and training the completed SLDA-TC text classification model, utilizes the text topic probability distribution, the word probability distribution of the topic and the category probability distribution of the topic to extract words and topics, documents and topics, and topics. The implicit semantic information mapping between categories improves text classification accuracy and time efficiency.
本领域内的技术人员应明白,本公开的实施例可提供为方法、系统、或计算 机程序产品。因此,本公开可采用硬件实施例、软件实施例、或结合软件和硬件 方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用 程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上 实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.
本公开是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的 流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框 图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。 可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可 编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据 处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或 方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备 以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指 令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和 /或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得 在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从 而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或 多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是 可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可 读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中, 所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM) 或随机存储记忆体(RandomAccessMemory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM) or the like.
上述虽然结合附图对本公开的具体实施方式进行了描述,但并非对本公开保 护范围的限制,所属领域技术人员应该明白,在本公开的技术方案的基础上,本 领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本公开的 保护范围以内。Although the specific embodiments of the present disclosure have been described above in conjunction with the accompanying drawings, they do not limit the protection scope of the present disclosure. Those skilled in the art should understand that on the basis of the technical solutions of the present disclosure, those skilled in the art do not need to pay creative efforts. Various modifications or variations that can be made are still within the protection scope of the present disclosure.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811398232.1A CN109408641B (en) | 2018-11-22 | 2018-11-22 | A text classification method and system based on supervised topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811398232.1A CN109408641B (en) | 2018-11-22 | 2018-11-22 | A text classification method and system based on supervised topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109408641A true CN109408641A (en) | 2019-03-01 |
CN109408641B CN109408641B (en) | 2020-06-02 |
Family
ID=65474659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811398232.1A Active CN109408641B (en) | 2018-11-22 | 2018-11-22 | A text classification method and system based on supervised topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408641B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135592A (en) * | 2019-05-16 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Classifying quality determines method, apparatus, intelligent terminal and storage medium |
CN110321434A (en) * | 2019-06-27 | 2019-10-11 | 厦门美域中央信息科技有限公司 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | A Bayesian-based LDA topic label calibration method, system and medium |
CN110795564A (en) * | 2019-11-01 | 2020-02-14 | 南京稷图数据科技有限公司 | Text classification method lacking negative cases |
CN110825850A (en) * | 2019-11-07 | 2020-02-21 | 哈尔滨工业大学(深圳) | Natural language theme classification method and device |
CN111368532A (en) * | 2020-03-18 | 2020-07-03 | 昆明理工大学 | A Disambiguation Method and System for Subject Word Embedding Based on LDA |
CN111723198A (en) * | 2019-03-18 | 2020-09-29 | 北京京东尚科信息技术有限公司 | Text emotion recognition method and device and storage medium |
CN112733542A (en) * | 2021-01-14 | 2021-04-30 | 北京工业大学 | Theme detection method and device, electronic equipment and storage medium |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
CN113591462A (en) * | 2021-07-28 | 2021-11-02 | 咪咕数字传媒有限公司 | Bullet screen reply generation method and device and electronic equipment |
CN114550749A (en) * | 2022-01-10 | 2022-05-27 | 山东师范大学 | Student behavior log generation method and system based on audio scene recognition |
CN114610576A (en) * | 2022-03-15 | 2022-06-10 | 中国银行股份有限公司 | Log generation monitoring method and device |
CN118333637A (en) * | 2024-06-13 | 2024-07-12 | 中南大学 | Product recall prediction method and system based on topic model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662960A (en) * | 2012-03-08 | 2012-09-12 | 浙江大学 | On-line supervised theme-modeling and evolution-analyzing method |
CN103810500A (en) * | 2014-02-25 | 2014-05-21 | 北京工业大学 | Place image recognition method based on supervised learning probability topic model |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
-
2018
- 2018-11-22 CN CN201811398232.1A patent/CN109408641B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662960A (en) * | 2012-03-08 | 2012-09-12 | 浙江大学 | On-line supervised theme-modeling and evolution-analyzing method |
CN103810500A (en) * | 2014-02-25 | 2014-05-21 | 北京工业大学 | Place image recognition method based on supervised learning probability topic model |
CN103903164A (en) * | 2014-03-25 | 2014-07-02 | 华南理工大学 | Semi-supervised automatic aspect extraction method and system based on domain information |
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
Non-Patent Citations (4)
Title |
---|
BLEI D M: "Supervised topic models", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 * |
SHIBIN ZHOU: "Text Categotization Based on Topic Model", 《INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS》 * |
李文波: "基于Labeled-LDA 模型的文本分类新算法", 《计算机学报》 * |
王丹丹: "基于宏特征融合的文本分类", 《中文信息学报》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723198A (en) * | 2019-03-18 | 2020-09-29 | 北京京东尚科信息技术有限公司 | Text emotion recognition method and device and storage medium |
CN111723198B (en) * | 2019-03-18 | 2023-09-01 | 北京汇钧科技有限公司 | Text emotion recognition method, device and storage medium |
CN110135592A (en) * | 2019-05-16 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Classifying quality determines method, apparatus, intelligent terminal and storage medium |
CN110135592B (en) * | 2019-05-16 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Classification effect determining method and device, intelligent terminal and storage medium |
CN110321434A (en) * | 2019-06-27 | 2019-10-11 | 厦门美域中央信息科技有限公司 | A kind of file classification method based on word sense disambiguation convolutional neural networks |
CN110569270A (en) * | 2019-08-15 | 2019-12-13 | 中国人民解放军国防科技大学 | A Bayesian-based LDA topic label calibration method, system and medium |
CN110569270B (en) * | 2019-08-15 | 2022-07-05 | 中国人民解放军国防科技大学 | Bayesian-based LDA topic label calibration method, system and medium |
CN110795564B (en) * | 2019-11-01 | 2022-02-22 | 南京稷图数据科技有限公司 | Text classification method lacking negative cases |
CN110795564A (en) * | 2019-11-01 | 2020-02-14 | 南京稷图数据科技有限公司 | Text classification method lacking negative cases |
CN110825850A (en) * | 2019-11-07 | 2020-02-21 | 哈尔滨工业大学(深圳) | Natural language theme classification method and device |
CN110825850B (en) * | 2019-11-07 | 2022-07-08 | 哈尔滨工业大学(深圳) | Natural language theme classification method and device |
CN111368532A (en) * | 2020-03-18 | 2020-07-03 | 昆明理工大学 | A Disambiguation Method and System for Subject Word Embedding Based on LDA |
CN111368532B (en) * | 2020-03-18 | 2022-12-09 | 昆明理工大学 | Topic word embedding disambiguation method and system based on LDA |
CN112733542B (en) * | 2021-01-14 | 2022-02-08 | 北京工业大学 | Theme detection method and device, electronic equipment and storage medium |
CN112733542A (en) * | 2021-01-14 | 2021-04-30 | 北京工业大学 | Theme detection method and device, electronic equipment and storage medium |
CN113032573A (en) * | 2021-04-30 | 2021-06-25 | 《中国学术期刊(光盘版)》电子杂志社有限公司 | Large-scale text classification method and system combining theme semantics and TF-IDF algorithm |
CN113032573B (en) * | 2021-04-30 | 2024-01-23 | 同方知网数字出版技术股份有限公司 | Large-scale text classification method and system combining topic semantics and TF-IDF algorithm |
CN113591462A (en) * | 2021-07-28 | 2021-11-02 | 咪咕数字传媒有限公司 | Bullet screen reply generation method and device and electronic equipment |
CN114550749A (en) * | 2022-01-10 | 2022-05-27 | 山东师范大学 | Student behavior log generation method and system based on audio scene recognition |
CN114610576A (en) * | 2022-03-15 | 2022-06-10 | 中国银行股份有限公司 | Log generation monitoring method and device |
CN118333637A (en) * | 2024-06-13 | 2024-07-12 | 中南大学 | Product recall prediction method and system based on topic model |
Also Published As
Publication number | Publication date |
---|---|
CN109408641B (en) | 2020-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109408641B (en) | A text classification method and system based on supervised topic model | |
Xu | Understanding graph embedding methods and their applications | |
CN106383877A (en) | On-line short text clustering and topic detection method of social media | |
Zhang et al. | Hypergraph based information-theoretic feature selection | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
CN114860930A (en) | A text classification method, device and storage medium | |
CN112269874A (en) | Text classification method and system | |
CN105955975A (en) | Knowledge recommendation method for academic literature | |
Vidyashree et al. | An improvised sentiment analysis model on twitter data using stochastic gradient descent (SGD) optimization algorithm in stochastic gate neural network (SGNN) | |
CN108470025A (en) | Partial-Topic probability generates regularization own coding text and is embedded in representation method | |
CN111209402A (en) | A text classification method and system integrating transfer learning and topic model | |
Budhiraja et al. | A supervised learning approach for heading detection | |
CN110674293A (en) | A text classification method based on semantic transfer | |
Mylonas et al. | Zero-shot classification of biomedical articles with emerging mesh descriptors | |
Chemchem et al. | Deep learning and data mining classification through the intelligent agent reasoning | |
Chen et al. | Quantifying similarity between relations with fact distribution | |
Subeno et al. | Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collapsed Gibbs Sampling Inference Process. | |
Kalangi et al. | Sentiment analysis using machine learning | |
Najar et al. | On smoothing and scaling language model for sentiment based information retrieval | |
Yazdi et al. | Generalized probabilistic clustering projection models for discrete data | |
Zhang et al. | Probabilistic verb selection for data-to-text generation | |
CN103793491B (en) | Chinese news story segmentation method based on flexible semantic similarity measurement | |
Saratha et al. | A novel approach for improving the accuracy using word embedding on deep neural networks for software requirements classification | |
Kim et al. | Variable selection for latent dirichlet allocation | |
US20250103813A1 (en) | Generating an improved named entity recognition model using noisy data with a self-cleaning discriminator model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |