Training query filtering for semi-supervised learning to rank with pseudo labels

Xin Zhang^1,2,
Ben He³ &
Tiejian Luo³

739 Accesses
Explore all metrics

Abstract

Semi-supervised learning is a machine learning paradigm that can be applied to create pseudo labels from unlabeled data for learning a ranking model, when there is only limited or no training examples available. However, the effectiveness of semi-supervised learning in information retrieval (IR) can be hindered by the low quality pseudo labels, hence the need for the training query filtering that removes the low quality queries. In this paper, we assume two application scenarios with respect to the availability of human labels. First, for applications without any labeled data available, a clustering-based approach is proposed to select the high quality training queries. This approach selects the training queries following the empirical observation that the relevant documents of high quality training queries are highly coherent. Second, for applications with limited labeled data available, a classification-based approach is proposed. This approach learns a weak classifier to predict the retrieval performance gain of a given training query by making use of query features. The queries with high performance gains are selected for the following transduction process to create the pseudo labels for learning to rank algorithms. Experimental results on the standard LETOR dataset show that our proposed approaches outperform the strong baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised Learning to Rank with Uncertain Data

Combining semi-supervised and active learning to rank algorithms: application to Document Retrieval

Article 04 October 2021

A Simple yet Effective Framework for Active Learning to Rank

Article 15 January 2024

References

Alldrin, N., Smith, A., Turnbull, D.: Clustering with em and k-means. Tech. rep., University of San Diego, California (2003)
Amati, G., Amodeo, G., Bianchi, M., Celi, A., Nicola, C.D., Flammini, M., Gaibisso, C., Gambosi, G., Marcone, G.: Fub, iasi-cnr, univaq at trec 2011. In: TREC (2011)
Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36, 105–139 (1999)
Article Google Scholar
Brando, W.C., Santos, R.L.T., Ziviani, N., de Moura, E.S., da Silva, A.S.: Learning to expand queries using entities. Journal of the Association for Information Science and Technology pp. n/a–n/a. doi: 10.1002/asi.23084 (2014)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). doi:10.1023/A:1018054314350
MathSciNet MATH Google Scholar
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp 89–96. ACM, New York (2005), doi:10.1145/1102351.1102363
Chapter Google Scholar
Carvalho, V.R., Elsas, J.L., Cohen, W.W., Carbonell, J.G.: A meta-learning approach for robust rank learning (2008)
Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning, 1st edn. MIT Press (2010)
Chelaru, S., Orellana-Rodriguez, C., Altingovde, I.: How useful is social feedback for learning to rank youtube videos? World Wide Web 17(5), 997–1025 (2014). doi:10.1007/s11280-013-0258-9
Article Google Scholar
Chu, W., Ghahramani, Z.: Extensions of gaussian processes for ranking: Semi-supervised and active learning (2005)
Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 web track. In: TREC (2011)
Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: TREC (2012)
Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. J. Artif. Int. Res. 10(1), 243–270 (1999). http://dl.acm.org/citation.cfm?id=1622859.1622867
MathSciNet MATH Google Scholar
Constantinopoulos, C., Likas, A.: Semi-supervised and active learning with the probabilistic {RBF} classifier. Neurocomputing 71(1315), 2489–2498 (2008). doi:10.1016/j.neucom.2007.11.039. http://www.sciencedirect.com/science/article/pii/S0925231208002117. Artificial Neural Networks (ICANN 2006) / Engineering of Intelligent Systems (ICEIS 2006)
Article Google Scholar
Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pp. 299–306. ACM, New York (2002). doi:10.1145/564376.564429
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997). doi:10.1023/A:1007413511361
Article MATH Google Scholar
Donmez, P., Lebanon, G., Balasubramanian, K.: Unsupervised supervised learning i: estimating classification and regression errors without labels. J. Mach. Learn. Res. 11, 1323–1351 (2010). http://dl.acm.org/citation.cfm?id=1756006.1859895
MathSciNet MATH Google Scholar
Duan, Y., Jiang, L., Qin, T., Zhou, M., Shum, H.Y.: An empirical study on learning to rank of tweets. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pp. 295–303. Association for Computational Linguistics, Stroudsburg. http://dl.acm.org/citation.cfm?id=1873781.1873815 (2010)
Duh, K., Kirchhoff, K.: Learning to rank with partially-labeled data. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 251–258. ACM, New York (2008). doi:10.1145/1390334.1390379
Chapter Google Scholar
El-yaniv, R., Pechyony, D.: Stable transductive learning. In: Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, pp. 35–49. Springer, Berlin (2006). doi:10.1007/11776420_6
Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)
Article MATH Google Scholar
Freund, Y., Schapire, R.E.: A short introduction to boosting. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 1401–1406. Morgan Kaufmann (1999)
Ganjisaffar, Y., Caruana, R., Lope, C.: Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pp. 85–94. ACM, New York (2011). doi:10.1145/2009916.2009932
Google Scholar
Geng, X., Qin, T., Liu, T., Cheng, X., Li, H.: Selecting optimal training data for learning to rank. Inf. Process. Manag. 47(5), 730–741 (2011). doi:10.1016/j.ipm.2011.01.002
Article Google Scholar
Geng, X., Qin, T., Liu, T.Y., Cheng, X.Q.: A noise-tolerant graphical model for ranking. Inf. Process. Manag. 48(2), 374–383 (2012). doi:10.1016/j.ipm.2011.11.003
Article Google Scholar
Guan, D., Yuan, W., Lee, Y.K., Lee, S.: Identifying mislabeled training data with the aid of unlabeled data. Appl. Intell. 35(3), 345–358 (2011). doi:10.1007/s10489-010-0225-4
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1) (2009)
He, X., Ji, M., Bao, H.: A unified active and semi-supervised learning framework for image compression. In: Computer Vision and Pattern Recognition (2009)
Hu, H., Sha, C., Wang, X., Zhou, A.: A unified framework for semi-supervised pu learning. World Wide Web 17(4), 493–510 (2014). doi:10.1007/s11280-013-0215-7
Article Google Scholar
Huang, J.X., Miao, J., He, B.: High performance query expansion using adaptive co-training. Inf. Process. Manag. 49(2), 441–453 (2013). doi:10.1016/j.ipm.2012.08.002
Article Google Scholar
Huang, X., Huang, Y.R., Wen, M., An, A., Liu, Y., Poon, J.: Applying data mining to pseudo-relevance feedback for high performance text retrieval. In: ICDM, pp. 295–306. IEEE Computer Society (2006)
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pp. 133–142. ACM, New York (2002). doi:10.1145/775047.775067
Google Scholar
Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220(4598) (1983)
Leng, Y., Xu, X., Qi, G.: Combining active learning and semi-supervised learning to construct {SVM} classifier. Knowl.-Based Syst. 44, 121–131 (2013). doi:10.1016/j.knosys.2013.01.032. http://www.sciencedirect.com/science/article/pii/S095070511300052X
Article Google Scholar
Li, D., He, B., Luo, T., Ma, Q.: Selecting training data for learning-based twitter search. In: ECIR. To appear (2015)
Li, H., Liao, X., Carin, L.: Active learning for semi-supervised multi-task learning. In: Proceedings of International Conference on Acoustics (2009)
Li, M., Li, H., hua Zhou, Z.: Semi-supervised document retrieval. Inf. Process. Manag. 45, 341–355 (2009). doi:10.1016/j.ipm.2008.11.002
Article Google Scholar
Lin, Y., Lin, H., Xu, K., Sun, X.: Learning to rank using smoothing methods for language modeling. J. Am. Soc. Inf. Sci. Technol. 64(4), 818–828 (2013). doi:10.1002/asi.22789
Article Google Scholar
Liu, T.: Learning to rank for information retrieval. Foundations and Trends in Information Retrieval (3), 225–331 (2009)
Liu, T., Xu, J., Qin, T., Xiong, W., Li, H.: Letor: benchmark dataset for research on learning to rank for information retrieval. In: SIGIR 2007 Workshop on Learning to Rank for Information Retrieval (2007)
Luo, Z., Osborne, M., Wang, T.: An effective approach to tweets opinion retrieval. World Wide Web, pp. 1–22 (2013). doi:10.1007/s11280-013-0268-7
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
McCallum, A., Nigam, K.: Employing em and pool-based active learning for text classification. In: Proceedings of the 15th International Conference on Machine Learning, ICML ’98, pp. 350–358. Morgan Kaufmann, San Francisco, CA (1998). http://dl.acm.org/citation.cfm?id=645527.757765
Muslea, I., Minton, S., Knoblock, C.A.: Selective sampling with redundant views. In: AAAI, pp. 621–626 (2000)
Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multi-view learning. In: Proceedings of the 19th International Conference on Machine Learning, ICML ’02, pp. 435–442. Morgan Kaufmann, San Francisco, CA (2002). http://dl.acm.org/citation.cfm?id=645531.655845
Google Scholar
Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the TREC 2011 microblog track. In: TREC. Gaithersburg, MD (2011)
Palei, S.K., Das, S.K.: Logistic regression model for prediction of roof fall risks in bord and pillar workings in coal mines: an approach. Saf. Sci. 47(1), 88–96 (2009). doi:10.1016/j.ssci.2008.01.002. http://www.sciencedirect.com/science/article/pii/S0925753508000118
Article Google Scholar
Pierce, D., Cardie, C.: Limitations of co-training for natural language learning from large datasets. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 1–9 (2001)
Prince, M.: Does active learning work? A review of the research. J. Eng. Educ. 93(3), 223–231 (2004). doi:10.1002/j.2168-9830.2004.tb00809.x
Article MathSciNet Google Scholar
Qin, T., Liu, T.: Introducing letor 4.0 datasets. Tech. rep., Microsoft Research Asia (2013)
Qin, T., Liu, T.Y., Xu, J., Li, H.: Letor: a benchmark collection for research on learning to rank for information retrieval. Inf. Retr. 13(4), 346–374 (2010). doi:10.1007/s10791-009-9123-y
Article Google Scholar
Reitmaie, T., Calma, A., Sick, B.: Transductive active learning a new semi-supervised learning approach based on iteratively refined generative models to capture structure in data. Inf. Sci. 293, 275–298 (2015)
Article Google Scholar
Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. In: TREC (1995)
Rocchio, J.: Relevance Feedback in Information Retrieval, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Google Scholar
Rosales, R., Krishnamurthy, P., Rao, R.B.: Semi-supervised active learning for modeling medical concepts from free text. In: ICMLA (2007)
Sellamanickam, S., Garg, P., Selvaraj, S.K.: A pairwise ranking based approach to learning with positive and unlabeled examples. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pp. 663–672. ACM, New York (2011). doi:10.1145/2063576.2063675
Google Scholar
Shtok, A., Kurland, O., Carmel, D., Raiber, F., Markovits, G.: Predicting query performance by query-drift estimation. ACM Trans. Inf. Syst. 30(2), 11:1–11:35 (2012). doi:10.1145/2180868.2180873
Article Google Scholar
Szummer, M., Yilmaz, E.: Semi-supervised learning to rank with preference regularization. In: Proceedings of the 20th ACM Conference on Conference on Information and Knowledge Management, CIKM ’11, pp. 269–278 (2011)
Tr, G., Hakkani-Tr, D.Z., Schapire, R.E.: Combining active and semi-supervised learning for spoken language understanding. Speech Comm. 45(2), 171–186 (2005). http://dblp.uni-trier.de/db/journals/speech/speech45.html#TurHS05
Article Google Scholar
Usunier, N., Truong, V., Amini, M.R., Gallinari, P., Curie, M.: Ranking with unlabeled data: a first study. In: Proceedings of NIPS Workshop (2005)
Valizadegan, H., et al.: Kernel based detection of mislabeled training examples (2007)
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Webb, G.I., Boughton, J.R., Wang, Z.: Not so naive bayes: aggregating one-dependence estimators. Mach. Learn. 58(1), 5–24 (2005). doi:10.1007/s10994-005-4258-6
Article MATH Google Scholar
Xu, J., Chen, C., Xu, G., Li, H., Abib, E.R.T.: Improving quality of training data for learning to rank using click-through data. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 171–180. ACM, New York (2010). doi:10.1145/1718487.1718509
Google Scholar
You, G.w., Park, J.w., Hwang, S.w., Nie, Z., Wen, J.R.: Socialsearch: enriching social network with web evidences. World Wide Web 16(5–6), 701–727 (2013). doi:10.1007/s11280-012-0165-5
Article Google Scholar
Yu, D., Varadarajan, B., Deng, L., Acero, A.: Active learning and semi-supervised learning for speech recognition: a unified framework using the global entropy reduction maximization criterion. Comput. Speech Lang. 24(3), 433–444 (2010). doi:10.1016/j.csl.2009.03.004
Article Google Scholar
Yue, Y., Finley, T., Radlinski, F., Joachims, T.: A support vector method for optimizing average precision. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, pp. 271–278. ACM, New York (2007). doi:10.1145/1277741.1277790
Chapter Google Scholar
Zhang, R., Tran, T., Mao, Y.: Opinion helpfulness prediction in the presence of words of few mouths. World Wide Web 15(2), 117–138 (2012). doi:10.1007/s11280-011-0127-3
Article Google Scholar
Zhang, X., He, B., Luo, T.: Transductive learning for real-time twitter search. In: The International Conference on Weblogs and Social Media (ICWSM), pp. 611–614 (2012)
Zhang, X., He, B., Luo, T., Li, B.: Query-biased learning to rank for real-time twitter search. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pp. 1915–1919. ACM, New York (2012). doi:10.1145/2396761.2398543
Google Scholar
Zhang, X., He, B., Luo, T., Li, D., Xu, J.: Clustering-based transduction for learning a ranking model with limited human labels. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, pp. 1777–1782. ACM, New York (2013). doi:10.1145/2505515.2505647
Google Scholar
Zhou, Y., Croft, W.B.: Query performance prediction in web search environments. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, pp. 543–550. ACM, New York (2007). doi:10.1145/1277741.1277835
Chapter Google Scholar
Zhou, Y., Goldman, S.: Democratic co-learning. In: ICTAI (2004)
Zhu, X.: Semi-supervised learning literature survey. Tech. rep., Department of Computer Sciences, University of Wisconsin at Madison Madison, WI. Available from: http://www.cs.wisc.edu/jerryzhu/pub/sslsurvey.pdf
Zhu, X., Lafferty, J., Ghahramani, Z.: Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pp. 58–65 (2003)

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (61103131/61472391), Beijing Natural Science Foundation (4142050) and SRF for ROCS, SEM.

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, China
Xin Zhang
China Electronics Technology Group Corporation No.38 Research Institute, Hefei, China
Xin Zhang
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
Ben He & Tiejian Luo

Authors

Xin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ben He
View author publications
You can also search for this author in PubMed Google Scholar
Tiejian Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ben He.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, X., He, B. & Luo, T. Training query filtering for semi-supervised learning to rank with pseudo labels. World Wide Web 19, 833–864 (2016). https://doi.org/10.1007/s11280-015-0363-z

Download citation

Received: 15 January 2015
Revised: 28 April 2015
Accepted: 01 July 2015
Published: 26 July 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11280-015-0363-z

Training query filtering for semi-supervised learning to rank with pseudo labels

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Learning to Rank with Uncertain Data

Combining semi-supervised and active learning to rank algorithms: application to Document Retrieval

A Simple yet Effective Framework for Active Learning to Rank

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Training query filtering for semi-supervised learning to rank with pseudo labels

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised Learning to Rank with Uncertain Data

Combining semi-supervised and active learning to rank algorithms: application to Document Retrieval

A Simple yet Effective Framework for Active Learning to Rank

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now