Unsupervised Learning by Probabilistic Latent Semantic Analysis

Thomas Hofmann¹

18k Accesses
8 Altmetric
Explore all metrics

Abstract

This paper presents a novel statistical method for factor analysis of binary and count data which is closely related to a technique known as Latent Semantic Analysis. In contrast to the latter method which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed technique uses a generative latent class model to perform a probabilistic mixture decomposition. This results in a more principled approach with a solid foundation in statistical inference. More precisely, we propose to make use of a temperature controlled version of the Expectation Maximization algorithm for model fitting, which has shown excellent performance in practice. Probabilistic Latent Semantic Analysis has many applications, most prominently in information retrieval, natural language processing, machine learning from text, and in related areas. The paper presents perplexity results for different types of text and linguistic data collections and discusses an application in automated document indexing. The experiments indicate substantial and consistent improvements of the probabilistic method over standard Latent Semantic Analysis.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Baker, L. D. & McCallum, A. K. (1998). Distributional clustering ofwords for text classification. In Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval (SIGIR).
Bellegarda, J. R. (1998). Exploiting both local and global constraints for multi-span statistical language modeling. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP'98, pp. 677–680.
Berry, M. W. Dumais, S. T., & O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573–595.
Google Scholar
Cheeseman, P. & Stutz, J. (1996). Bayesian classification (AutoClass): Theory and results. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy Uthurusamy, (Eds.), Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press.
Coccaro, N. & Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical language modeling. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP).
Deerwester, S., Dumais, G. W., Furnas, S. T., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407.
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. B, 39, 1–38.
Google Scholar
Dumais, S. T. (1995). Latent semantic indexing (LSI): TREC-3 report. In D.K Harman, (Ed.), Proceedings of the Text REtrieval Conference (TREC-3), pp. 219–230.
Foltz, P. W. & Dumais, S. T. (1992). An analysis of information filtering methods. Communications of the ACM, 35(12), 51–60.
Google Scholar
Gilula, Z., & Haberman, S. J. (1986). Canonical analysis of contingency tables by maximum likelihood. Journal of the American Statistical Association, 81(395), 780–788.
Google Scholar
Golub, G. H. & Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins University Press, 3rd (ed.).
Hofmann, T., Puzicha, J., & Jordan, M. I. (1999). Unsupervised learning from dyadic data. In Advances in Neural Information Processing Systems, Vol. 11, MIT Press.
Katz, S. M. (1987). Estimation of probabilities for sparse data for the language model component of a speech recogniser. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3), 400–401.
Google Scholar
Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review.
LDC. Linguistic Data Consortium: TDT pilot study corpus documentation. http://www.ldc.upenn.edu/TDT, 1997.
Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(675), 788–791.
Google Scholar
Neal, R. M. & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental and other variants. In M.I. Jordan, (Ed.), Learning in Graphical Models, Dordrecht, MA: Kluwer Academic Publishers, pp. 355–368.
Google Scholar
Pereira, F. C. N., Tishby, N. Z., & Lee, L. (1983). Distributional clustering of english words. In Proceedings of the ACL, pp. 183–190.
Rose, K., Gurewitz, E., & Fox, G. (1990). A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(11), 589–594.
Google Scholar
Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Google Scholar
Saul, L. & Pereira, F. (1997). Aggregate and mixed-order Markov models for statistical language processing. In Proceedings of the 2nd International Conference on Empirical Methods in Natural Language Processing, pp. 81–89.
Ueda, N. & Nakano, R. (1988). Deterministic annealing EM algorithm. Neural Networks, 11(2), 271–282.
Google Scholar
Witten, I. H. & Bell, T. C. (1991). The zero-frequency problem—estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4); 1085–1094.
Google Scholar
Wolfe, M. B. W., Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W. & Landauer, T. K. (1998). Learning from text: Matching readers and texts by latent semantic analysis. Discourse Processes, 25(2/3), 309–336.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Brown University, Providence, RI, 02912, USA
Thomas Hofmann

Authors

Thomas Hofmann
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hofmann, T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001). https://doi.org/10.1023/A:1007617005950

Download citation

Issue Date: January 2001
DOI: https://doi.org/10.1023/A:1007617005950

Unsupervised Learning by Probabilistic Latent Semantic Analysis

Abstract

Article PDF

Similar content being viewed by others

Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization

A Variational Bayes Approach to Factor Analysis

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Unsupervised Learning by Probabilistic Latent Semantic Analysis

Abstract

Article PDF

Similar content being viewed by others

Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Revisiting the Past to Reinvent the Future: Topic Modeling with Single Mode Factorization

A Variational Bayes Approach to Factor Analysis

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article