Tutorial: Supervised Learning for Prevalence Estimation

Alejandro Moreo¹⁴ &
Fabrizio Sebastiani¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11529))

Included in the following conference series:

International Conference on Flexible Query Answering Systems

731 Accesses

Abstract

Quantification is the task of estimating, given a set $\sigma $ of unlabelled items and a set of classes $\mathcal {C}$, the relative frequency (or “prevalence”) $p(c_{i})$ of each class $c_{i}\in \mathcal {C}$. Quantification is important in many disciplines (such as e.g., market research, political science, the social sciences, and epidemiology) which usually deal with aggregate (as opposed to individual) data. In these contexts, classifying individual unlabelled instances is usually not a primary goal, while estimating the prevalence of the classes of interest in the data is. Quantification may in principle be solved via classification, i.e., by classifying each item in $\sigma $ and counting, for all $c_{i}\in \mathcal {C}$, how many such items have been labelled with $c_{i}$. However, it has been shown in a multitude of works that this “classify and count” (CC) method yields suboptimal quantification accuracy, one of the reasons being that most classifiers are optimized for classification accuracy, and not for quantification accuracy. As a result, quantification has come to be no longer considered a mere byproduct of classification, and has evolved as a task of its own, devoted to designing methods and algorithms that deliver better prevalence estimates than CC. The goal of this tutorial is to introduce the main supervised learning techniques that have been proposed for solving quantification, the metrics used to evaluate them, and the most promising directions for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On optimal Bayesian classification and risk estimation under multiple classes

Article Open access 24 October 2015

Bayesian Robust Regression with the Horseshoe+ Estimator

DD Plots and Prediction Regions

References

Barranquero, J., Díez, J., del Coz, J.J.: Quantification-oriented learning based on reliable classifiers. Pattern Recogn. 48(2), 591–604 (2015). https://doi.org/10.1016/j.patcog.2014.07.032
Article MATH Google Scholar
Barranquero, J., González, P., Díez, J., del Coz, J.J.: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recogn. 46(2), 472–482 (2013)
Article Google Scholar
Bella, A., Ferri, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Quantification via probability estimators. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010), Sydney, AU, pp. 737–742 (2010)
Google Scholar
Da San Martino, G., Gao, W., Sebastiani, F.: Ordinal text quantification. In: Proceedings of the 39th ACM Conference on Research and Development in Information Retrieval (SIGIR 2016), Pisa, IT, pp. 937–940 (2016)
Google Scholar
du Plessis, M.C., Niu, G., Sugiyama, M.: Class-prior estimation for learning from positive and unlabeled data. Mach. Learn. 106(4), 463–492 (2017)
Article MathSciNet Google Scholar
Esuli, A., Moreo, A., Sebastiani, F.: Cross-lingual sentiment quantification (2019). arXiv:1904.07965
Esuli, A., Sebastiani, F.: Optimizing text quantifiers for multivariate loss functions. ACM Trans. Knowl. Discov. Data 9(4), Article ID 27 (2015)
Article Google Scholar
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008)
Article MathSciNet Google Scholar
Gao, W., Sebastiani, F.: From classification to quantification in tweet sentiment analysis. Soc. Netw. Anal. Min. 6(19), 1–22 (2016)
Google Scholar
González, P., Castaño, A., Chawla, N.V., del Coz, J.J.: A review on quantification learning. ACM Comput. Surv. 50(5), 74:1–74:40 (2017)
Article Google Scholar
González-Castro, V., Alaiz-Rodríguez, R., Alegre, E.: Class distribution estimation based on the Hellinger distance. Inf. Sci. 218, 146–164 (2013)
Article Google Scholar
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54(1), 229–247 (2010)
Article Google Scholar
Kar, P., Li, S., Narasimhan, H., Chawla, S., Sebastiani, F.: Online optimization methods for the quantification problem. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, US, pp. 1625–1634 (2016)
Google Scholar
Maletzke, A.G., dos Reis, D.M., Batista, G.E.: Combining instance selection and self-training to improve data stream quantification. J. Braz. Comput. Soc. 24(12), 43–48 (2018)
Google Scholar
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013), Dallas, US, pp. 528–536 (2013)
Google Scholar
Milli, L., Monreale, A., Rossetti, G., Pedreschi, D., Giannotti, F., Sebastiani, F.: Quantification in social networks. In: Proceedings of the 2nd IEEE International Conference on Data Science and Advanced Analytics (DSAA 2015), Paris, FR (2015)
Google Scholar
Moreno-Torres, J.G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012)
Article Google Scholar
Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput. 14(1), 21–41 (2002)
Article Google Scholar
Sebastiani, F.: Evaluation measures for quantification: an axiomatic approach. Inf. Retrieval J. (2019, to appear)
Google Scholar
Tang, L., Gao, H., Liu, H.: Network quantification despite biased labels. In: Proceedings of the 8th Workshop on Mining and Learning with Graphs (MLG 2010), Washington, US, pp. 147–154 (2010)
Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, 56124, Pisa, Italy
Alejandro Moreo & Fabrizio Sebastiani

Authors

Alejandro Moreo
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabrizio Sebastiani .

Editor information

Editors and Affiliations

University of Calabria, Rende, Italy
Alfredo Cuzzocrea
University of Calabria, Rende, Italy
Sergio Greco
Legind Technologies , Esbjerg , Denmark
Henrik Legind Larsen
University of Calabria, Rende, Italy
Domenico Saccà
Roskilde University, Roskilde, Denmark
Troels Andreasen
Roskilde University, Roskilde, Denmark
Henning Christiansen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moreo, A., Sebastiani, F. (2019). Tutorial: Supervised Learning for Prevalence Estimation. In: Cuzzocrea, A., Greco, S., Larsen, H., Saccà, D., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2019. Lecture Notes in Computer Science(), vol 11529. Springer, Cham. https://doi.org/10.1007/978-3-030-27629-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-27629-4_3
Published: 12 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27628-7
Online ISBN: 978-3-030-27629-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tutorial: Supervised Learning for Prevalence Estimation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On optimal Bayesian classification and risk estimation under multiple classes

Bayesian Robust Regression with the Horseshoe+ Estimator

DD Plots and Prediction Regions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Tutorial: Supervised Learning for Prevalence Estimation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On optimal Bayesian classification and risk estimation under multiple classes

Bayesian Robust Regression with the Horseshoe+ Estimator

DD Plots and Prediction Regions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation