Keywords

1 Introduction

Latent Tree Analysis (LTA) attempts to describe the correlation between a set of observed variables using a tree model called Latent Tree Model (LTM) [1, 2]. In the model, leaf nodes represent observation variables, internal nodes represent latent variables, and the dependencies between two observation variables are explained by the paths between them. In recent years, LTA model has been widely used in academic research, and put forward some effective new ideas, such as cluster analysis [3, 4], topic detection [5], depth probability modelling [6] and so on. Among them, the text data in topic detection applications show the best results. Liu et al. used the word co-occurrence matrix to model the words in the text collection and soft-partitioned the document [5]. The result was that each document might belong to a different partition, and the collection of documents in the partition was interpreted as a topic. In addition, LTM divides the learned latent variables into multiple levels. This led to another approach to hierarchical topic detection, Hierarchical Latent Tree Analysis (HLTA). It proved to be the most advanced methods, themes and better looking than before on the topic hierarchy latent dirichlet allocation based on the most advanced methods [7].

HLTA depends on the EM algorithm to estimate parameters, so there is still some room for improvement in efficiency. Chen et al. uses progressive EM (PEM) to improve the HLTA intermediate model parameter estimation [8]. In each step, PEM only calculates the maximum local likelihood function of the submodels in the model. That is, EM is running on a model that only involves 3 or 4 observation variables. The improvement of PEM is performed on the E-step of EM algorithm. This paper adopts the method of gradient acceleration optimization and improves it from M-step, and thereby further enhance the computational efficiency of PEM algorithm.

2 Appearance

The LTM is a tree-structured Bayesian network in which leaf nodes represent observational variables and internal nodes represent latent variables [3, 9]. In general, the LTM has n observation variables \( {\mathbf{X}} = \left\{ {x_{1} ,x_{2} , \ldots , x_{n} } \right\} \) and m latent variables \( {\mathbf{Z}} = \left\{ {z_{1} ,z_{2} , \ldots ,z_{m} } \right\} \). The parent value of the variable Y is represented as \( pa\left( Y \right) \), and \( Y \) is set as the root and \( pa\left( Y \right) \) is empty. LTM defines the joint distribution of all observations and latent variables \( p\left( {x_{1} , \ldots ,x_{n} ,z_{1} , \ldots ,z_{m} } \right) = \prod\nolimits_{{Y \in \varvec{X} \cup \varvec{Z}}} {P\left( {Y |pa\left( Y \right)} \right)} \).

Liu et al. proposed a method for analyzing text data and obtaining models based on LTM [5]. At the bottom layer of observed variable representative of a variable in binary form in the presence or absence of the document. At the bottom of the top layers have a plurality of latent variables, each of the lower probability variable representing a word co-occurrence used to explain the relationship between the word co-occurrence. Therefore, the theme of the model obtained at a low level has a specific meaning, and the theme captured at a high level has a more abstract meaning.

2.1 Pretreatment

Prior to analysis items selected n words having the highest TF-IDF values mean average TF-IDF method [5, 8, 10]. For a document set D, the term t-average \( {\text{TF}}{\rm - }{\text{IDF}}\left( {t,D} \right) = \frac{{\mathop \sum \nolimits_{d \in D} tf\left( {t,d} \right) \cdot idf\left( {t,D} \right)}}{\left| D \right|} \). Where \( \left| D \right| \) represents the total number of files in the corpus and \( tf\left( {t,d} \right) \) is the frequency in item t document d. \( idf\left( {t,D} \right) = log\left( {\left| D \right|/\left| {\left\{ {d \in D:t \in d} \right\}} \right|} \right) \) is the inverse document frequency of the term \( t \) in document set \( D \). The traditional TF-IDF thinks that the terms are mutually independent, but in each document expression, combining the current situation, context, and semantics, the terms are related to each other. In order to make up for the word term mutual information calculation when the subject word is extracted, usually only the word frequency is considered. This paper uses the part of speech and the traditional TF-IDF (Pos Weight TF-IDF, PW_TF-IDF) [11] to calculate the term of the document. The PW_TF-IDF value attempts to optimize the term selection to improve subject consistency.

2.2 PEM

The EM algorithm is one of the statistical algorithms often used for parameter estimation problems. In a latent tree model m, let \( {\mathbf{X}} \) and \( {\mathbf{H}} \) represent the set of observation variables and latent variables, respectively, \( {\mathbf{V}} = {\mathbf{X}} \cup {\mathbf{H}} \). Assume that a latent variable is selected as the root and all edges are far from the root. For any variable \( v \) that is not root in \( {\mathbf{V}} \), \( {\text{pa}}\left( v \right) \) for \( v \) is a latent variable that takes a value of “0” or “1”. When v is the root, \( {\text{pa}}\left( v \right) \) is a virtual variable with only one possible value. List all the variables \( v_{1} ,v_{2} , \ldots ,v_{\text{n}} \). The parameter of m is \( \theta_{ijk} = P(v_{i} = k|{\text{pa}}\left( {v_{i} } \right) = j) \).

Where \( {\text{i}} \in \left\{ {1, \ldots ,{\text{n}}} \right\} \), \( {\text{k }} \) is the value of \( v_{i} \), and \( {\text{j}} \) is the value of \( {\text{pa}}\left( {v_{\text{i}} } \right) \). \( \theta \) is a vector of all parameters. For a given data set \( {\text{D}} \), the log-likelihood function θ is given by \( l\left( {\theta |D} \right) = \mathop \sum \limits_{d \in D} \mathop \sum \limits_{{\mathbf{H}}} logP(d,{\mathbf{H}}|\theta ) \). The maximum likelihood estimate \( \theta \) is the value of the maximum log-likelihood function. Start estimating the parameter value of \( \theta^{\left( 0 \right)} \), and then generate a sequence of estimates \( \left\{ {\theta^{\left( 1 \right)} ,\theta^{\left( 2 \right)} , \ldots } \right\} \). Assuming the current estimate \( \theta^{{\left( {\text{t}} \right)}} \), the next estimate \( \theta^{{\left( {{\text{t}} + 1} \right)}} \) is obtained through the E step and the ME step. For the latent tree model, the two steps of the EM algorithm are as follows:

$$ {\text{n}}_{ijk}^{{\left( {\text{t}} \right)}} = \sum\nolimits_{d \in D} {P(v_{i} = k, pa\left( {v_{i} } \right) = j|d,m,\theta^{t} )} $$
(1)
$$ \theta_{ijk}^{t + 1} = \frac{{n_{ijk}^{t} }}{{\mathop \sum \nolimits_{k} n_{ijk}^{t} }} $$
(2)

The PEM calculation submodel is shown in Fig. 1. Supposing that Y is selected as the root and all parameters of the model are estimated. Firstly, running the EM model shaded in Fig. 1(a), estimate P(Y), P(A|Y), P(B|Y), and P(D|Y), then running the EM model in Fig. 1(b) of the shaded part; fix P(Y), P(B|Y) and P(D|Y) to estimate P(Z|Y), P(C|Z) and P(E|Z). The shortage of the EM algorithm is that the computation complexity is large and the convergence speed is slow when the data set is relatively large. Various methods for accelerating EM algorithms have been proposed, such as incremental EM algorithm, lazy EM algorithm, and hybrid EM algorithm. Chen et al. PEM algorithm [8] computational complexity improvement mainly in the E-step and the M-step is not considered, and some acceleration gradient M-step process optimization, an improved method of Step E in combination can further improve the computational efficiency of the EM algorithm.

Fig. 1.
figure 1

PEM submodel

3 Research Methods

3.1 Word Selection Based on PW_TF-IDF

There are two main aspects of keyword selection, word weight and theme model selection. This article adds word-based information to words based on word frequency. To select a more suitable word, proceed to the following topic model. Han et al. studied the contribution of different part-of-speech features in texts, and verified the influence of nouns, verbs, adjectives and adverbs, and their combinations on Chinese and English texts [12]. The experimental results show that these four parts of speech are important part of speech characterizing the content of the text. This paper will statistically count the percentages of nouns, verbs, adjectives and adverbs after word segmentation, and give different part-of-speech weight coefficients to these four parts of speech. Other parts of speech are still calculated according to the traditional TF-IDF. \( PW\_TF{\rm - }IDF = k *TF{\rm - }IDF \). Where TF-IDF is the value obtained by the traditional calculation method. The coefficient \( k \) is the weight coefficient of the four parts of speech. Through the random sampling of 1000 documents in the nips and Reuters data sets, the percentages of nouns, verbs, adjectives and adverbs are obtained as part of speech. The weight coefficients are shown in Table 1.

Table 1. Weight coefficient of part of speech

3.2 Improved Aitken Accelerated PEM

The Aitken acceleration method is based on the iterative function of the simple iterative method to construct a new iterative function. Theorem [13]: Let the sequence \( \left\{ {p_{n} |n \in \left[ {0,\infty } \right)} \right\} \) converge linearly to the limit \( p \) with \( p - p_{n} \ne 0 \). Satisfaction: \( \mathop {\lim }\limits_{{{\text{n}} \to \infty }} \frac{{p - p_{n + 1} }}{{p - p_{n} }} = A(\left| A \right| < 1) \), then define the sequence \( \{ r_{n} |n \in \left[ {0,\infty } \right)\} \) Convergence to \( p \), and faster than the sequence \( \left\{ {p_{n} } \right\} \), the result is closer to the true value of p, That is \( \mathop {\lim }\limits_{{{\text{n}} \to \infty }} \frac{{p - r_{n} }}{{p - p_{n} }} = 0 \). It defines the sequence formula: \( r_{n} = p_{n} - \frac{{\left( {p_{n + 1} - p_{n} } \right)^{2} }}{{2p_{n + 2} - 3p_{n + 1} + p_{n} }} \). The above method is applied to the log-likelihood sequence \( \left\{ \theta \right\} \) in the PEM algorithm, and Aitken acceleration is performed using the above theorem, which is applied to each sub-model of the latent tree model.

figure a

4 Experimental Results

It seeks to optimize PEM-HLTA; therefore, the optimization method mentioned in this text should be compared with PEM-HLTA. On the other hand, it is not necessary to compare with the methods of LDA, HLDA or nHDP because the PEM-HLTA has been proved valid in the literature [5, 8].

Experimental environment: Windows 7 64/cpu i5 3.2 GHz/Ram 12G/java 1.8. All experimental parameters were the same as that of reference [8].

4.1 Data Sources

NIPSFootnote 1 data set and ReutersFootnote 2 data set adopted in experiment are different from the Liu et al. [5] and Chen et al. [8] NIPS data set, we use the data set NIPS from Kaggle, from the 1987 meeting of the current session in 2016, has 6560 documents. The NIPS data is divided into two experiments. The experiment selects 1955 documents in the same way as document Chen et al. [8]. Experiment 2 uses all documents. Each experiment uses TF-IDF values and PW_TF-IDF values to select vocabulary sizes 1000, 3000, 5000, 7000, and 10000 in five versions, using Nips-1k, Nips-3k, Nips-5k, Nips-7k, and Nips-10k indicates. Two sets of NIPS data were compared using PWA-PEM-HLTAFootnote 3 and PEM-HLTAFootnote 4 after pretreatment. Experiment 3 uses exactly the same configuration as Experiment 1. The only difference is the use of the Reuters data set to verify that the method has the same effect on different types of data sets.

4.2 Conformity Assessment Method

The score of topic semantic coherence was calculated using the [14] method. Subject t’s theme consistency score is defined as:

$$ C\left( {t,W^{\left( t \right)} } \right) = \sum\nolimits_{m = 2}^{M} {\sum\nolimits_{t = 1}^{m - 1} {log} } \frac{{D\left( {w_{m}^{\left( t \right)} ,w_{l}^{\left( t \right)} } \right) + 1}}{{D\left( {w_{l}^{\left( t \right)} } \right)}} $$
(3)

Where \( \varvec{W}^{{\left( {\text{t}} \right)}} = \left\{ {w_{1}^{{\left( {\text{t}} \right)}} , \ldots ,w_{\text{m}}^{{\left( {\text{t}} \right)}} } \right\} \) is the first m words for describing the subject t. \( {\text{D}}\left( {{\text{w}}_{\text{i}} } \right) \) is the document frequency of word \( {\text{w}}_{\text{i}} \). \( {\text{D}}\left( {{\text{w}}_{\text{i}} , {\text{w}}_{\text{j}} } \right) \) is the common document frequency of words \( {\text{w}}_{\text{i}} \) and \( {\text{w}} \). The document frequency is the number of documents containing these words. Given two sets of topics, topics with higher average theme coherence are considered better topics.

4.3 Experiment 1

Table 2 shows the runtime statistics. The improved method is obviously better than the PEM-HLTA method, and the average improvement efficiency is around 5 times. The efficiency increase rate is shown in Table 3. The average execution efficiency is increased by 4.9 times. In the best case, the execution efficiency is increased by 6 times. From Fig. 2(a), it can be intuitively found that the A-PEM-HLTA method and the PWA-PEM-HLTA method have no explicit difference before the Nips-10k. When the number of words reaches 10K, the PWA-PEM-HLTA is optimal and used 58 min, while the PEM-HLTA method used 431 min. At the same time, it is found that comparing the use of part-of-speech weights and not using part-of-speech weights, the use of part-of-speech weights may lead to the further extraction of words that are closer to the subject, reducing the number of PEM iterations and improving EM implementation efficiency.

Table 2. nips-1955 runtime/min
Table 3. nips-1955 comparison of improvement multiples (multiple = pre/post improved -1)
Fig. 2.
figure 2

Comparison of nips-1955 running time and consistency score

Table 4 shows the average topical consistency score for the topic generated by the improved algorithm. PW-PEM-HLTA (optimization of POS) and PEM-HLTA, A-PEM-HLTA (Aitken acceleration optimization) and PWA-PEM-HLTA (Attenuation Optimization of POS) show that nips-5k is a watershed. When the word exceeds 5k, the participatory weights have some advantages in the choice of terms. When the number of selected terms is small, the top ten words are covered by the TF-IDF value. When the range of selected words is expanded, when the middle and latter parts of all words are selected, the advantages of the word weight selection terms are reflected. When PEM-HLTA was compared with A-PEM-HLTA, PW-PEM-HLTA and PWA-PEM-HLTA using accelerated optimization, the average subject consistency score was significantly decreased, and the average score reduction was around 0.81 ± 0.2. In Fig. 2(b), the PW-PEM-HLTA consistency score is best when the number of words reaches 5K. However, when the choice of terms increases, most of the terms are selected, and the result scores converge with the PEM-HLTA. While the overall trend of the average topic consistency scores in the A-PEM-HLTA and PWA-PEM-HLTA methods is consistent. When the words exceed 5K, the latter has a slight improvement over the former. At the same time, the overall consistency score after using Aitken acceleration optimization can be reduced. This is because Aitken acceleration adopts a simple iterative method and oscillates around the convergence value, which does not guarantee stable growth of the EM likelihood result.

Table 4. nips-1955 average thematic consistency score

Planned solution: Using the original M algorithm or ECM algorithm when oscillating near the convergence value so that the likelihood result can grow steadily again.

4.4 Experiment 2

Compared the results of Experiment 2 in Tables 5, 6 and 7 with those of Experiment 1 in Tables 2, 3 and 4, the calculation efficiency improvement average fold value is 4.97, which is approximately the same as the result of Experiment 1. The word-based weighting tends to be the same as the tendency of increasing the calculation efficiency and the average subject consistency score. The experimental results show that the improved method has the same effect on small data sets and relatively large data.

Table 5. nips-6560 running time/min
Table 6. nips-6560 average theme consistency score
Table 7. Reuters-2000 running time/min

4.5 Experiment 3

The results are shown in Tables 7 and 8. Under the same environmental conditions, the performance of experimental results on the Reuters news data set has the same trend as that of the nips data set, but the effect is not as efficient as the improvement of the nips data set. After statistical analysis of the data set, Reuters news data has a total number of word segmentation of 289,759, an average of 144.8795 single document word counts, a total number of word segmentation of nips data set 8515607, and an average number of single document word 4257.8035. Reuters is more sparse than nips data sets when using the bag-of-words model to represent documents. Therefore, there is no big nips in the improvement of computational efficiency.

Table 8. Reuters-2000 average theme consistency score

5 Conclusions

Based on a state-of-the-art hierarchical topic detection method called HLTM, we improved the PEM-HLTA method to reduce computation time. We can use a single machine to handle relatively larger data sets instead of just adding more computing resources. The empirical results show that PWA-PEM-HLTA has a significant improvement in the efficiency of the implementation, allowing 10k words on a personal computer, the data set of 6k documents can be calculated within 12 h, and data of 5k words in 6k documents can be calculated in 2 h.

In the future, we plan to further study the application of HLTA’s multi-categorization of words and to improve the topic semantic consistency scores. The other is distributed research on HLTA.