PWA-PEM for Latent Tree Model and Hierarchical Topic Detection

Zhuchen Liu¹⁸,
Hao Chen¹⁹,
Jie Li¹⁸ &
…
Yanhua Yu¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 538))

Included in the following conference series:

International Conference on Intelligent Information Processing

1157 Accesses
1 Citations

Abstract

Hierarchical Latent Tree Analysis (HLTA) is a new method of topic detection. However, HLTA data input uses TF-IDF selection term, and relies on EM algorithm for parameter estimation. To solve this problem, a method of accelerating part of speech weight (PWA-PEM-HLTA) is proposed based on Progressive EM-HLTA (PEM-HLTA). Experimental results show that this method improves the execution efficiency of PEM-HLTA, averaging 4.9 times speed, and improves the speed of 6 times in the best case.

You have full access to this open access chapter, Download conference paper PDF

Topic Modeling for Speech and Language Processing

Sentence-Based Topic Modeling Using Lexical Analysis

Online LDA-Based Language Model Adaptation

Keywords

1 Introduction

Latent Tree Analysis (LTA) attempts to describe the correlation between a set of observed variables using a tree model called Latent Tree Model (LTM) [1, 2]. In the model, leaf nodes represent observation variables, internal nodes represent latent variables, and the dependencies between two observation variables are explained by the paths between them. In recent years, LTA model has been widely used in academic research, and put forward some effective new ideas, such as cluster analysis [3, 4], topic detection [5], depth probability modelling [6] and so on. Among them, the text data in topic detection applications show the best results. Liu et al. used the word co-occurrence matrix to model the words in the text collection and soft-partitioned the document [5]. The result was that each document might belong to a different partition, and the collection of documents in the partition was interpreted as a topic. In addition, LTM divides the learned latent variables into multiple levels. This led to another approach to hierarchical topic detection, Hierarchical Latent Tree Analysis (HLTA). It proved to be the most advanced methods, themes and better looking than before on the topic hierarchy latent dirichlet allocation based on the most advanced methods [7].

HLTA depends on the EM algorithm to estimate parameters, so there is still some room for improvement in efficiency. Chen et al. uses progressive EM (PEM) to improve the HLTA intermediate model parameter estimation [8]. In each step, PEM only calculates the maximum local likelihood function of the submodels in the model. That is, EM is running on a model that only involves 3 or 4 observation variables. The improvement of PEM is performed on the E-step of EM algorithm. This paper adopts the method of gradient acceleration optimization and improves it from M-step, and thereby further enhance the computational efficiency of PEM algorithm.

2 Appearance

The LTM is a tree-structured Bayesian network in which leaf nodes represent observational variables and internal nodes represent latent variables [3, 9]. In general, the LTM has n observation variables $ {\mathbf{X}} = \left\{ {x_{1} ,x_{2} , \ldots , x_{n} } \right\} $ and m latent variables $ {\mathbf{Z}} = \left\{ {z_{1} ,z_{2} , \ldots ,z_{m} } \right\} $. The parent value of the variable Y is represented as $ pa\left( Y \right) $, and $ Y $ is set as the root and $ pa\left( Y \right) $ is empty. LTM defines the joint distribution of all observations and latent variables $ p\left( {x_{1} , \ldots ,x_{n} ,z_{1} , \ldots ,z_{m} } \right) = \prod\nolimits_{{Y \in \varvec{X} \cup \varvec{Z}}} {P\left( {Y |pa\left( Y \right)} \right)} $.

Liu et al. proposed a method for analyzing text data and obtaining models based on LTM [5]. At the bottom layer of observed variable representative of a variable in binary form in the presence or absence of the document. At the bottom of the top layers have a plurality of latent variables, each of the lower probability variable representing a word co-occurrence used to explain the relationship between the word co-occurrence. Therefore, the theme of the model obtained at a low level has a specific meaning, and the theme captured at a high level has a more abstract meaning.

2.1 Pretreatment

Prior to analysis items selected n words having the highest TF-IDF values mean average TF-IDF method [5, 8, 10]. For a document set D, the term t-average $ {\text{TF}}{\rm - }{\text{IDF}}\left( {t,D} \right) = \frac{{\mathop \sum \nolimits_{d \in D} tf\left( {t,d} \right) \cdot idf\left( {t,D} \right)}}{\left| D \right|} $. Where $ \left| D \right| $ represents the total number of files in the corpus and $ tf\left( {t,d} \right) $ is the frequency in item t document d. $ idf\left( {t,D} \right) = log\left( {\left| D \right|/\left| {\left\{ {d \in D:t \in d} \right\}} \right|} \right) $ is the inverse document frequency of the term $ t $ in document set $ D $. The traditional TF-IDF thinks that the terms are mutually independent, but in each document expression, combining the current situation, context, and semantics, the terms are related to each other. In order to make up for the word term mutual information calculation when the subject word is extracted, usually only the word frequency is considered. This paper uses the part of speech and the traditional TF-IDF (Pos Weight TF-IDF, PW_TF-IDF) [11] to calculate the term of the document. The PW_TF-IDF value attempts to optimize the term selection to improve subject consistency.

2.2 PEM

The EM algorithm is one of the statistical algorithms often used for parameter estimation problems. In a latent tree model m, let $ {\mathbf{X}} $ and $ {\mathbf{H}} $ represent the set of observation variables and latent variables, respectively, $ {\mathbf{V}} = {\mathbf{X}} \cup {\mathbf{H}} $. Assume that a latent variable is selected as the root and all edges are far from the root. For any variable $ v $ that is not root in $ {\mathbf{V}} $, $ {\text{pa}}\left( v \right) $ for $ v $ is a latent variable that takes a value of “0” or “1”. When v is the root, $ {\text{pa}}\left( v \right) $ is a virtual variable with only one possible value. List all the variables $ v_{1} ,v_{2} , \ldots ,v_{\text{n}} $. The parameter of m is $ \theta_{ijk} = P(v_{i} = k|{\text{pa}}\left( {v_{i} } \right) = j) $.

Where $ {\text{i}} \in \left\{ {1, \ldots ,{\text{n}}} \right\} $, $ {\text{k }} $ is the value of $ v_{i} $, and $ {\text{j}} $ is the value of $ {\text{pa}}\left( {v_{\text{i}} } \right) $. $ \theta $ is a vector of all parameters. For a given data set $ {\text{D}} $, the log-likelihood function θ is given by $ l\left( {\theta |D} \right) = \mathop \sum \limits_{d \in D} \mathop \sum \limits_{{\mathbf{H}}} logP(d,{\mathbf{H}}|\theta ) $. The maximum likelihood estimate $ \theta $ is the value of the maximum log-likelihood function. Start estimating the parameter value of $ \theta^{\left( 0 \right)} $, and then generate a sequence of estimates $ \left\{ {\theta^{\left( 1 \right)} ,\theta^{\left( 2 \right)} , \ldots } \right\} $. Assuming the current estimate $ \theta^{{\left( {\text{t}} \right)}} $, the next estimate $ \theta^{{\left( {{\text{t}} + 1} \right)}} $ is obtained through the E step and the ME step. For the latent tree model, the two steps of the EM algorithm are as follows:

$$ {\text{n}}_{ijk}^{{\left( {\text{t}} \right)}} = \sum\nolimits_{d \in D} {P(v_{i} = k, pa\left( {v_{i} } \right) = j|d,m,\theta^{t} )} $$

(1)

$$ \theta_{ijk}^{t + 1} = \frac{{n_{ijk}^{t} }}{{\mathop \sum \nolimits_{k} n_{ijk}^{t} }} $$

(2)

The PEM calculation submodel is shown in Fig. 1. Supposing that Y is selected as the root and all parameters of the model are estimated. Firstly, running the EM model shaded in Fig. 1(a), estimate P(Y), P(A|Y), P(B|Y), and P(D|Y), then running the EM model in Fig. 1(b) of the shaded part; fix P(Y), P(B|Y) and P(D|Y) to estimate P(Z|Y), P(C|Z) and P(E|Z). The shortage of the EM algorithm is that the computation complexity is large and the convergence speed is slow when the data set is relatively large. Various methods for accelerating EM algorithms have been proposed, such as incremental EM algorithm, lazy EM algorithm, and hybrid EM algorithm. Chen et al. PEM algorithm [8] computational complexity improvement mainly in the E-step and the M-step is not considered, and some acceleration gradient M-step process optimization, an improved method of Step E in combination can further improve the computational efficiency of the EM algorithm.

3 Research Methods

3.1 Word Selection Based on PW_TF-IDF

There are two main aspects of keyword selection, word weight and theme model selection. This article adds word-based information to words based on word frequency. To select a more suitable word, proceed to the following topic model. Han et al. studied the contribution of different part-of-speech features in texts, and verified the influence of nouns, verbs, adjectives and adverbs, and their combinations on Chinese and English texts [12]. The experimental results show that these four parts of speech are important part of speech characterizing the content of the text. This paper will statistically count the percentages of nouns, verbs, adjectives and adverbs after word segmentation, and give different part-of-speech weight coefficients to these four parts of speech. Other parts of speech are still calculated according to the traditional TF-IDF. $ PW\_TF{\rm - }IDF = k *TF{\rm - }IDF $. Where TF-IDF is the value obtained by the traditional calculation method. The coefficient $ k $ is the weight coefficient of the four parts of speech. Through the random sampling of 1000 documents in the nips and Reuters data sets, the percentages of nouns, verbs, adjectives and adverbs are obtained as part of speech. The weight coefficients are shown in Table 1.

Table 1. Weight coefficient of part of speech

Full size table

3.2 Improved Aitken Accelerated PEM

The Aitken acceleration method is based on the iterative function of the simple iterative method to construct a new iterative function. Theorem [13]: Let the sequence $ \left\{ {p_{n} |n \in \left[ {0,\infty } \right)} \right\} $ converge linearly to the limit $ p $ with $ p - p_{n} \ne 0 $. Satisfaction: $ \mathop {\lim }\limits_{{{\text{n}} \to \infty }} \frac{{p - p_{n + 1} }}{{p - p_{n} }} = A(\left| A \right| < 1) $, then define the sequence $ \{ r_{n} |n \in \left[ {0,\infty } \right)\} $ Convergence to $ p $, and faster than the sequence $ \left\{ {p_{n} } \right\} $, the result is closer to the true value of p, That is $ \mathop {\lim }\limits_{{{\text{n}} \to \infty }} \frac{{p - r_{n} }}{{p - p_{n} }} = 0 $. It defines the sequence formula: $ r_{n} = p_{n} - \frac{{\left( {p_{n + 1} - p_{n} } \right)^{2} }}{{2p_{n + 2} - 3p_{n + 1} + p_{n} }} $. The above method is applied to the log-likelihood sequence $ \left\{ \theta \right\} $ in the PEM algorithm, and Aitken acceleration is performed using the above theorem, which is applied to each sub-model of the latent tree model.

4 Experimental Results

It seeks to optimize PEM-HLTA; therefore, the optimization method mentioned in this text should be compared with PEM-HLTA. On the other hand, it is not necessary to compare with the methods of LDA, HLDA or nHDP because the PEM-HLTA has been proved valid in the literature [5, 8].

Experimental environment: Windows 7 64/cpu i5 3.2 GHz/Ram 12G/java 1.8. All experimental parameters were the same as that of reference [8].

4.1 Data Sources

NIPS^{Footnote 1} data set and Reuters^{Footnote 2} data set adopted in experiment are different from the Liu et al. [5] and Chen et al. [8] NIPS data set, we use the data set NIPS from Kaggle, from the 1987 meeting of the current session in 2016, has 6560 documents. The NIPS data is divided into two experiments. The experiment selects 1955 documents in the same way as document Chen et al. [8]. Experiment 2 uses all documents. Each experiment uses TF-IDF values and PW_TF-IDF values to select vocabulary sizes 1000, 3000, 5000, 7000, and 10000 in five versions, using Nips-1k, Nips-3k, Nips-5k, Nips-7k, and Nips-10k indicates. Two sets of NIPS data were compared using PWA-PEM-HLTA^{Footnote 3} and PEM-HLTA^{Footnote 4} after pretreatment. Experiment 3 uses exactly the same configuration as Experiment 1. The only difference is the use of the Reuters data set to verify that the method has the same effect on different types of data sets.

4.2 Conformity Assessment Method

The score of topic semantic coherence was calculated using the [14] method. Subject t’s theme consistency score is defined as:

$$ C\left( {t,W^{\left( t \right)} } \right) = \sum\nolimits_{m = 2}^{M} {\sum\nolimits_{t = 1}^{m - 1} {log} } \frac{{D\left( {w_{m}^{\left( t \right)} ,w_{l}^{\left( t \right)} } \right) + 1}}{{D\left( {w_{l}^{\left( t \right)} } \right)}} $$

(3)

Where $ \varvec{W}^{{\left( {\text{t}} \right)}} = \left\{ {w_{1}^{{\left( {\text{t}} \right)}} , \ldots ,w_{\text{m}}^{{\left( {\text{t}} \right)}} } \right\} $ is the first m words for describing the subject t. $ {\text{D}}\left( {{\text{w}}_{\text{i}} } \right) $ is the document frequency of word $ {\text{w}}_{\text{i}} $. $ {\text{D}}\left( {{\text{w}}_{\text{i}} , {\text{w}}_{\text{j}} } \right) $ is the common document frequency of words $ {\text{w}}_{\text{i}} $ and $ {\text{w}} $. The document frequency is the number of documents containing these words. Given two sets of topics, topics with higher average theme coherence are considered better topics.

4.3 Experiment 1

Table 2 shows the runtime statistics. The improved method is obviously better than the PEM-HLTA method, and the average improvement efficiency is around 5 times. The efficiency increase rate is shown in Table 3. The average execution efficiency is increased by 4.9 times. In the best case, the execution efficiency is increased by 6 times. From Fig. 2(a), it can be intuitively found that the A-PEM-HLTA method and the PWA-PEM-HLTA method have no explicit difference before the Nips-10k. When the number of words reaches 10K, the PWA-PEM-HLTA is optimal and used 58 min, while the PEM-HLTA method used 431 min. At the same time, it is found that comparing the use of part-of-speech weights and not using part-of-speech weights, the use of part-of-speech weights may lead to the further extraction of words that are closer to the subject, reducing the number of PEM iterations and improving EM implementation efficiency.

Table 2. nips-1955 runtime/min

Full size table

Table 3. nips-1955 comparison of improvement multiples (multiple = pre/post improved -1)

Full size table

Table 4 shows the average topical consistency score for the topic generated by the improved algorithm. PW-PEM-HLTA (optimization of POS) and PEM-HLTA, A-PEM-HLTA (Aitken acceleration optimization) and PWA-PEM-HLTA (Attenuation Optimization of POS) show that nips-5k is a watershed. When the word exceeds 5k, the participatory weights have some advantages in the choice of terms. When the number of selected terms is small, the top ten words are covered by the TF-IDF value. When the range of selected words is expanded, when the middle and latter parts of all words are selected, the advantages of the word weight selection terms are reflected. When PEM-HLTA was compared with A-PEM-HLTA, PW-PEM-HLTA and PWA-PEM-HLTA using accelerated optimization, the average subject consistency score was significantly decreased, and the average score reduction was around 0.81 ± 0.2. In Fig. 2(b), the PW-PEM-HLTA consistency score is best when the number of words reaches 5K. However, when the choice of terms increases, most of the terms are selected, and the result scores converge with the PEM-HLTA. While the overall trend of the average topic consistency scores in the A-PEM-HLTA and PWA-PEM-HLTA methods is consistent. When the words exceed 5K, the latter has a slight improvement over the former. At the same time, the overall consistency score after using Aitken acceleration optimization can be reduced. This is because Aitken acceleration adopts a simple iterative method and oscillates around the convergence value, which does not guarantee stable growth of the EM likelihood result.

Table 4. nips-1955 average thematic consistency score

Full size table

Planned solution: Using the original M algorithm or ECM algorithm when oscillating near the convergence value so that the likelihood result can grow steadily again.

4.4 Experiment 2

Compared the results of Experiment 2 in Tables 5, 6 and 7 with those of Experiment 1 in Tables 2, 3 and 4, the calculation efficiency improvement average fold value is 4.97, which is approximately the same as the result of Experiment 1. The word-based weighting tends to be the same as the tendency of increasing the calculation efficiency and the average subject consistency score. The experimental results show that the improved method has the same effect on small data sets and relatively large data.

Table 5. nips-6560 running time/min

Full size table

Table 6. nips-6560 average theme consistency score

Full size table

Table 7. Reuters-2000 running time/min

Full size table

4.5 Experiment 3

The results are shown in Tables 7 and 8. Under the same environmental conditions, the performance of experimental results on the Reuters news data set has the same trend as that of the nips data set, but the effect is not as efficient as the improvement of the nips data set. After statistical analysis of the data set, Reuters news data has a total number of word segmentation of 289,759, an average of 144.8795 single document word counts, a total number of word segmentation of nips data set 8515607, and an average number of single document word 4257.8035. Reuters is more sparse than nips data sets when using the bag-of-words model to represent documents. Therefore, there is no big nips in the improvement of computational efficiency.

Table 8. Reuters-2000 average theme consistency score

Full size table

5 Conclusions

Based on a state-of-the-art hierarchical topic detection method called HLTM, we improved the PEM-HLTA method to reduce computation time. We can use a single machine to handle relatively larger data sets instead of just adding more computing resources. The empirical results show that PWA-PEM-HLTA has a significant improvement in the efficiency of the implementation, allowing 10k words on a personal computer, the data set of 6k documents can be calculated within 12 h, and data of 5k words in 6k documents can be calculated in 2 h.

In the future, we plan to further study the application of HLTA’s multi-categorization of words and to improve the topic semantic consistency scores. The other is distributed research on HLTA.

Notes

References

Zhang, N.L., Poon, L.K.M.: Latent tree analysis. In: Thirtieth AAAI Conference on Artificial Intelligence, pp. 4891–4897. AAAI Publications (2017)
Google Scholar
Knott, M., Bartholomew, D.J.: Latent variable models and factor analysis. J. Educ. Stat. 7(4), 650–663 (1999)
MathSciNet MATH Google Scholar
Chen, T., Zhang, N.L., Liu, T., Poon, K.M., Wang, Y.: Model-based multidimensional clustering of categorical data. Artif. Intell. 176(1), 2246–2269 (2012)
Article MathSciNet Google Scholar
Li, Y., Aggen, S., Shi, S.: Subtypes of major depression: latent class analysis in depressed Han Chinese women. Psychol. Med. 44, 3275–3288 (2014)
Article Google Scholar
Liu, T., Zhang, N.L., Chen, P.: Hierarchical latent tree analysis for topic detection. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 256–272. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44851-9_17
Chapter Google Scholar
Chen, Z., Zhang, N.L., Yeung, D.Y., Chen, P.: Sparse Boltzmann machines with structure learning as applied to text analysis. In: Thirty-First AAAI CAI, pp. 1805–1811 (2017)
Google Scholar
Paisley, J., Wang, C., Blei, D.M., Michael, I.J.: Nested hierarchical Dirichlet processes. Pattern Anal. Mach. Intell. 37(2), 256–270 (2015)
Article Google Scholar
Chen, P., Zhang, N.L., Poon, K.M., Chen, Z.: Progressive EM for latent tree models and hierarchical topic detection. In: Thirtieth AAAI CAI, pp. 1498–1504 (2016)
Google Scholar
Zhang, N.L.: Hierarchical latent class models for cluster analysis. J. Mach. Learn. Res. 5, 697–723 (2004)
MathSciNet MATH Google Scholar
Chen, P., Zhang, N.L., Liu, T., et al.: Latent tree models for hierarchical topic detection. Artif. Intell. 250, 105–124 (2017)
Article MathSciNet Google Scholar
Gong, Y.X., Lin, P., Ren, W., et al.: Thematic word extraction algorithm based on improved TF-IDF algorithm and co-occurrence words. J. Nanjing Univ. Nat. Sci. Ed. 53(6), 1072–1080 (2017)
Google Scholar
Han, P., Wang, D., Liu, Y., et al.: Research on the influence of part of speech on the clustering of Chinese and English texts. Chin. J. Inf. 27(2), 65–73 (2013)
Google Scholar
Yang, W.Y., Cao, W., Chung, T.S., et al.: Applied Numerical Methods Using MATLAB, pp. 201–202. Wiley (2004)
Google Scholar
Mimno, D., Wallach, H.M., Talley, E., et al.: Optimizing semantic coherence in topic models. In: EMNLP, pp. 262–272 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Zhuchen Liu, Jie Li & Yanhua Yu
China Institute of Marine Industrial Systems Engineering, Beijing, China
Hao Chen

Authors

Zhuchen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jie Li
View author publications
You can also search for this author in PubMed Google Scholar
Yanhua Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhuchen Liu or Yanhua Yu .

Editor information

Editors and Affiliations

Institute of Computing Technology, CAS, Beijing, China
Zhongzhi Shi
University of Reims Champagne-Ardenne, Saint Drezery, France
Eunika Mercier-Laurent
University of South Australia, Mawson Lakes, SA, Australia
Jiuyong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Z., Chen, H., Li, J., Yu, Y. (2018). PWA-PEM for Latent Tree Model and Hierarchical Topic Detection. In: Shi, Z., Mercier-Laurent, E., Li, J. (eds) Intelligent Information Processing IX. IIP 2018. IFIP Advances in Information and Communication Technology, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-030-00828-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-00828-4_19
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00827-7
Online ISBN: 978-3-030-00828-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

PWA-PEM for Latent Tree Model and Hierarchical Topic Detection

Abstract

Similar content being viewed by others

Topic Modeling for Speech and Language Processing

Sentence-Based Topic Modeling Using Lexical Analysis

Online LDA-Based Language Model Adaptation

Keywords

1 Introduction