Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

Aleksander Kołcz¹ &
Joshua Alspector²

392 Accesses
6 Citations
Explore all metrics

Abstract

In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.

Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.

Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.

Article PDF

Learning to Rank

On Incomplete Label Ranking with IF-sets

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Billsus D and Pazzani M (1999) A hybrid user model for news story classification. In: Seventh International Conference on User Modeling (UM '99). http://www.ics.uci.edu/?pazzani/Publications/um99.ps.
Blum A and Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory.
Buckley C and Salton G (1995) Optimization of relevance feedback weights. In: Proceedings of 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 351–357.
Buckley C, Salton G, Allan J and Singhal A (1995) Automatic query expansion using SMART: TREC-3. In: Harman DK, Ed., Proceedings of the 3rd Text Retrieval Conference (TREC-3) (NIST SP 500-225).
Claypool M, Gokhale A, Miranda T, Murnikov P, Netes D and Sartin M (1999) Combining contentbased and collaborative filters in an online newspaper. ACM SIGIR Recommender Systems Workshop.http://www.cs.wpi.edu/?claypool/papers/content-collab/content-collab.ps.
Cohn D, Atlas Land Ladner R(1994) Improving generalization with active learning. Machine Learning, 15(2):201–221.
Google Scholar
Croft W and Harper D (1979) Using probabilistic models of document retrieval without relevance information.Journal of Documentation, 35:285–295.
Google Scholar
Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society, Series B, 39:1–38.
Google Scholar
Domingos P (1999) MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99).
Foltz PW and Dumais ST (1992) Personalized information delivery: An analysis of information filtering methods.Communications of the ACM, 35(12):51–60. http://www-psych.nmsu.edu/?pfoltz/cacm/cacm.html.
Google Scholar
Frakes W Band Baeza-Yates R (1992) Eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Boston, MA.
Google Scholar
Harman D (1992) Ranking algorithms. In: Frakes WB and Baeza-Yates R, Eds., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Boston, MA, pp. 363–392.
Google Scholar
Iyengar V, Apte C and Zhang T (2000) Active learning using adaptive resampling. In: Proceedings of ACM SIGKDD 2000.
Japkowicz N (2000) The class imbalance problem: Significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence. http://www.cs.dal.ca/?nat/Papers/ic-ai-2000.ps.
Jennings A, Higichi H and Liu H (1993) A user model neural network for a personal news service. Australian Telecommunication Research, 27(1):1–12.
Google Scholar
Joachims T (1997) Text categorization with support vector machines: Learning with many relevant features.Technical Report LS-8/23, University of Dortmund.
Joachims T (1999) Making large-scale svm learning practical. In: Schoelkopf B, Burges C and Smola A, Eds., Advances in Kernel Methods-Support Vector Learning. MIT Press.
Joachims T, Freitag D and Mitchell T (1997)Webwatcher: A tour guide for the world wide web. In: Proceedings of the International Joint Conference on Artificial Intelligence. http://www.cs.cmu.edu/afs/cs/project/theo-6/webagent/ www/ijcai97.ps.
Kamba T, Sahagami H and Koseki Y (1997) ANATANAGONOMY: A personalized newspaper on the world wide web. Int. J. Human-Compuer Studies, 46:789–803.
Google Scholar
Kubat M, Holte R and Matwin S (1997) Learning when negative examples abound. In: Proceedings of the European Conference on Machine Learning, ECML'97, pp. 146–153. http://www.cacs.louisiana.edu/? mkubat/publications/imbalanced.ps.
Kubat M and Matwin S (1997) Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, ICML'97, pp. 179–186. http://www.cacslouisiana.edu/?mkubat/publications/sampling.ps.
Lang K (1995) NewsWeeder: Learning to filter NetNews. In: Proceedings of the 12th International Conference on Machine Learning: ICML-95, pp. 331–339.
Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, pp. 4–15. http://www.research.att.com/ ?lewis/papers/lewis98b.ps.
Lewis DDand Gale WA(1994)Asequential algorithm for training text classifiers. In: Croft WBand van Rijsbergen CJ, Eds., SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, Dublin, Ireland, pp. 3–12.
Google Scholar
Lieberman H (1995) Letizia: An agent that assists web browsing. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 924–929. http://lieber.www.media.mit.edu/people/lieber/Lieberary/ Letizia/Letizia-AAAI/Letizia.html.
Losee RM (1998) Text Retrieval & Filtering: Analytic Models of Performance. Kluwer Academic Publishers, New York.
Google Scholar
McCallum AK and Nigam K (1997) Employing EM in pool-based active learning for text classification.In: Proceedings of the 1998 International Machine Learning Conference, pp. 25–32. http://www.cs.cmu.edu/?mccallum/papers/emactive-icm198.ps.gz.
McCallum AK and Nigam K (1998) A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. http://www.cs.cmu.edu/?mccallum/papers/multinomialaaai98w.ps.
McLachlan GJ and Krishnan T (1996) The EM Algorithm and Extensions. JohnWiley & Sons, Philadelphia, PA.
Google Scholar
Mitra M, Singhal Aand Buckley C(1997) Learning queries in a query zone. In: Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32.
Morik K, Imboff M, Brockhausen P, Joachims T and Gather U (2000) Knowledge discovery and knowledge validation in intensive care. Artificial Intelligence in Medicine, 19(3):225–249.
Google Scholar
Morita Mand Shinoda Y (1994) Information filtering based on user behavior analysis and best match text retrieval.In: Proceedings of 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–281.
Nickerson A, Japkowicz N and Milios E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics.
Nigam K and Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the Ninth International Conference on Information and Knowledge Management.
Nigam K, McCallum AK, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2):103–134.
Google Scholar
Pazzani M and Billsus D (1997) Learning and revising user profiles: The identification of interesting web sites.Machine Learning, 27:313–331. http://www.ics.uci.edu/?pazzani/publications/SW-MLJ.pdf.
Google Scholar
Pazzani M, Merz C, Murphy, P, Ali K, Hume T and Brunk C (1994) Reducing misclassification costs. In: 11th International Conference of Machine Learning, pp. 217–225. http://www.ics.uci.edu/?pazzani/publications/ MLC94.pdf.
Pednault E, Rosen BK and Apte C (2000) Handling imbalanced data sets in insurance risk modeling. Technical Report RC-21731, IBM.
Platt J (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In: Smola A, Bartlett P, Scholkopf B and Schuurmans D, Eds., Advances in Large-Margin Classifiers (Neural Information Processing), MIT Press
Porter M (1980) An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14(3):130–137.
Google Scholar
Provost F (2000) Machine learning from imbalanced data sets 101. In: Japkowicz N, Ed., Learning from Imbalanced Data Sets-Papers from the AAAI Workshop. AAAI Press, Austin, TX, pp. 1–3.
Google Scholar
Robertson SE and Sparck-Jones K (1976) Relevance weighting of search terms. JASIS pp. 129–176.
Rocchio J (1971) Relevance feedback in information retrieval. In: Salton G, Ed., The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, pp. 313–323.
Salton G (1971) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Melbourne, Australia.
Google Scholar
Schapire RE, Singer Yand Singhal, A(1998) Boosting and Rocchio applied to text filtering. In: Proceedings of 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–223.http://www.research.att.com/?schapire/cgi-bin/uncompress-papers/SchapireSiSi98.ps.
Schohn G and Cohn D (2000) Less is more: Active learning with support vector machines. In: Proceedings of the Seventeenth International Conference on Machine Learning.
Seung HS, Opper M and Sompolinsky H (1992) Query by committee. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 287–294.
Stevens C (1992) Automating the creation of information filters. Communications of the ACM, 35(12):48.
Google Scholar
van Rijsbergen CJ (1979) Information Retrieval. Buttersworth, London.
Google Scholar
Vapnik VN (1998) Statistical Learning Theory. John Wiley, New York.
Google Scholar
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49.http://www.cs.cmu.edu/?yiming/papers.yy/sigir99.ps.

Download references

Author information

Authors and Affiliations

Personalogy, Inc., 24 South Weber Suite 325, Colorado Springs, CO, 80903, USA
Aleksander Kołcz
Department of Electrical and Computer Engineering, University of Colorado at Colorado Springs, 1420 Austin Bluffs Pkwy., Colorado Springs, CO, 80918, USA
Joshua Alspector

Authors

Aleksander Kołcz
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Alspector
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kołcz, A., Alspector, J. Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking. Information Retrieval 5, 5–40 (2002). https://doi.org/10.1023/A:1012714523368

Download citation

Issue Date: January 2002
DOI: https://doi.org/10.1023/A:1012714523368

Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

Abstract

Article PDF

Similar content being viewed by others

Learning to Rank

Learning to Rank

On Incomplete Label Ranking with IF-sets

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

Abstract

Article PDF

Similar content being viewed by others

Learning to Rank

Learning to Rank

On Incomplete Label Ranking with IF-sets

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation