Abstract
In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.
Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.
Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Billsus D and Pazzani M (1999) A hybrid user model for news story classification. In: Seventh International Conference on User Modeling (UM '99). http://www.ics.uci.edu/?pazzani/Publications/um99.ps.
Blum A and Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory.
Buckley C and Salton G (1995) Optimization of relevance feedback weights. In: Proceedings of 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 351–357.
Buckley C, Salton G, Allan J and Singhal A (1995) Automatic query expansion using SMART: TREC-3. In: Harman DK, Ed., Proceedings of the 3rd Text Retrieval Conference (TREC-3) (NIST SP 500-225).
Claypool M, Gokhale A, Miranda T, Murnikov P, Netes D and Sartin M (1999) Combining contentbased and collaborative filters in an online newspaper. ACM SIGIR Recommender Systems Workshop.http://www.cs.wpi.edu/?claypool/papers/content-collab/content-collab.ps.
Cohn D, Atlas Land Ladner R(1994) Improving generalization with active learning. Machine Learning, 15(2):201–221.
Croft W and Harper D (1979) Using probabilistic models of document retrieval without relevance information.Journal of Documentation, 35:285–295.
Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, Series B, 39:1–38.
Domingos P (1999) MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99).
Foltz PW and Dumais ST (1992) Personalized information delivery: An analysis of information filtering methods.Communications of the ACM, 35(12):51–60. http://www-psych.nmsu.edu/?pfoltz/cacm/cacm.html.
Frakes W Band Baeza-Yates R (1992) Eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Boston, MA.
Harman D (1992) Ranking algorithms. In: Frakes WB and Baeza-Yates R, Eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Boston, MA, pp. 363–392.
Iyengar V, Apte C and Zhang T (2000) Active learning using adaptive resampling. In: Proceedings of ACM SIGKDD 2000.
Japkowicz N (2000) The class imbalance problem: Significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence. http://www.cs.dal.ca/?nat/Papers/ic-ai-2000.ps.
Jennings A, Higichi H and Liu H (1993) A user model neural network for a personal news service. Australian Telecommunication Research, 27(1):1–12.
Joachims T (1997) Text categorization with support vector machines: Learning with many relevant features.Technical Report LS-8/23, University of Dortmund.
Joachims T (1999) Making large-scale svm learning practical. In: Schoelkopf B, Burges C and Smola A, Eds., Advances in Kernel Methods-Support Vector Learning. MIT Press.
Joachims T, Freitag D and Mitchell T (1997)Webwatcher: A tour guide for the world wide web. In: Proceedings of the International Joint Conference on Artificial Intelligence. http://www.cs.cmu.edu/afs/cs/project/theo-6/webagent/ www/ijcai97.ps.
Kamba T, Sahagami H and Koseki Y (1997) ANATANAGONOMY: A personalized newspaper on the world wide web. Int. J. Human-Compuer Studies, 46:789–803.
Kubat M, Holte R and Matwin S (1997) Learning when negative examples abound. In: Proceedings of the European Conference on Machine Learning, ECML'97, pp. 146–153. http://www.cacs.louisiana.edu/? mkubat/publications/imbalanced.ps.
Kubat M and Matwin S (1997) Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, ICML'97, pp. 179–186. http://www.cacslouisiana.edu/?mkubat/publications/sampling.ps.
Lang K (1995) NewsWeeder: Learning to filter NetNews. In: Proceedings of the 12th International Conference on Machine Learning: ICML-95, pp. 331–339.
Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, pp. 4–15. http://www.research.att.com/ ?lewis/papers/lewis98b.ps.
Lewis DDand Gale WA(1994)Asequential algorithm for training text classifiers. In: Croft WBand van Rijsbergen CJ, Eds., SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, Dublin, Ireland, pp. 3–12.
Lieberman H (1995) Letizia: An agent that assists web browsing. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 924–929. http://lieber.www.media.mit.edu/people/lieber/Lieberary/ Letizia/Letizia-AAAI/Letizia.html.
Losee RM (1998) Text Retrieval & Filtering: Analytic Models of Performance. Kluwer Academic Publishers, New York.
McCallum AK and Nigam K (1997) Employing EM in pool-based active learning for text classification.In: Proceedings of the 1998 International Machine Learning Conference, pp. 25–32. http://www.cs.cmu.edu/?mccallum/papers/emactive-icm198.ps.gz.
McCallum AK and Nigam K (1998) A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. http://www.cs.cmu.edu/?mccallum/papers/multinomialaaai98w.ps.
McLachlan GJ and Krishnan T (1996) The EM Algorithm and Extensions. JohnWiley & Sons, Philadelphia, PA.
Mitra M, Singhal Aand Buckley C(1997) Learning queries in a query zone. In: Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32.
Morik K, Imboff M, Brockhausen P, Joachims T and Gather U (2000) Knowledge discovery and knowledge validation in intensive care. Artificial Intelligence in Medicine, 19(3):225–249.
Morita Mand Shinoda Y (1994) Information filtering based on user behavior analysis and best match text retrieval.In: Proceedings of 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–281.
Nickerson A, Japkowicz N and Milios E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics.
Nigam K and Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the Ninth International Conference on Information and Knowledge Management.
Nigam K, McCallum AK, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2):103–134.
Pazzani M and Billsus D (1997) Learning and revising user profiles: The identification of interesting web sites.Machine Learning, 27:313–331. http://www.ics.uci.edu/?pazzani/publications/SW-MLJ.pdf.
Pazzani M, Merz C, Murphy, P, Ali K, Hume T and Brunk C (1994) Reducing misclassification costs. In: 11th International Conference of Machine Learning, pp. 217–225. http://www.ics.uci.edu/?pazzani/publications/ MLC94.pdf.
Pednault E, Rosen BK and Apte C (2000) Handling imbalanced data sets in insurance risk modeling. Technical Report RC-21731, IBM.
Platt J (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In: Smola A, Bartlett P, Scholkopf B and Schuurmans D, Eds., Advances in Large-Margin Classifiers (Neural Information Processing), MIT Press
Porter M (1980) An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14(3):130–137.
Provost F (2000) Machine learning from imbalanced data sets 101. In: Japkowicz N, Ed., Learning from Imbalanced Data Sets-Papers from the AAAI Workshop. AAAI Press, Austin, TX, pp. 1–3.
Robertson SE and Sparck-Jones K (1976) Relevance weighting of search terms. JASIS pp. 129–176.
Rocchio J (1971) Relevance feedback in information retrieval. In: Salton G, Ed., The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, pp. 313–323.
Salton G (1971) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Melbourne, Australia.
Schapire RE, Singer Yand Singhal, A(1998) Boosting and Rocchio applied to text filtering. In: Proceedings of 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–223.http://www.research.att.com/?schapire/cgi-bin/uncompress-papers/SchapireSiSi98.ps.
Schohn G and Cohn D (2000) Less is more: Active learning with support vector machines. In: Proceedings of the Seventeenth International Conference on Machine Learning.
Seung HS, Opper M and Sompolinsky H (1992) Query by committee. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 287–294.
Stevens C (1992) Automating the creation of information filters. Communications of the ACM, 35(12):48.
van Rijsbergen CJ (1979) Information Retrieval. Buttersworth, London.
Vapnik VN (1998) Statistical Learning Theory. John Wiley, New York.
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49.http://www.cs.cmu.edu/?yiming/papers.yy/sigir99.ps.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kołcz, A., Alspector, J. Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking. Information Retrieval 5, 5–40 (2002). https://doi.org/10.1023/A:1012714523368
Issue Date:
DOI: https://doi.org/10.1023/A:1012714523368