Abstract
Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
These properties can be obtained by applying the aforementioned exponential family properties.
- 2.
Details in the Appendix.
- 3.
- 4.
The global model is included as a rough guide for the best possible generalization performance, even though it does not solve the task of denoising data within each year.
References
Adams, R., Ji, Y., Wang, X., Saria, S.: Learning models from data with measurement error: tackling underreporting. arXiv:1901.09060 (2019)
Athey, S., Imbens, G.: Recursive partitioning for heterogeneous causal effects. Proc. Nat. Acad. Sci. 113, 7353–7360 (2016)
Bang, H., Robins, J.: Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973 (2005)
Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, vol. 571. John Wiley & Sons, Hoboken (2005)
Chouldechova, A., Benavides-Prado, D., Fialko, O., Vaithianathan. R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp. 134–148. PMLR (2018)
Davidson, T., Bhattacharya, D., Weber, I.: Racial bias in hate speech and abusive language detection datasets. In: Workshop on Abusive Language Online (2019)
Formann, A.K., Kohlmann, T.: Latent class analysis in medical research. Statist. Methods Med. Res. 5(2), 179–211 (1996)
Genbäck, M., de Luna, X.: Causal inference accounting for unobserved confounding after outcome regression and doubly robust estimation. Biometrics 75(2), 506–515 (2019)
Horton, N.J., Laird, N.M.: Maximum likelihood analysis of generalized linear models with missing covariates. Statist. Methods Med. Res. 8(1), 37–50 (1999)
Hutchinson, R.A., He, L., Emerson, S.C.: Species distribution modeling of citizen science data as a classification problem with class-conditional noise. In: AAAI, pp. 4516–4523 (2017)
Ibrahim, J.G., Weisberg, S.: Incomplete data in generalized linear models with continuous covariates. Australia J. Statist. 34(3), 461–470 (1992)
Ibrahim, J.G., Lipsitz, S.R., Chen, M.-H.: Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Statist. Soc. Ser. B (Statist. Methodol.) 61(1), 173–190 (1999)
Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. statist. Assoc. 91(433), 222–230 (1996)
Knape, J., Korner-Nievergelt, F.: On assumptions behind estimates of abundance from counts at multiple sites. Methods Ecol. Evol. 7(2), 206–209 (2016)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Technique. MIT Press, Cambridge (2009)
Kupperman, M.: Probabilities of hypotheses and information-statistics in sampling from exponential-class populations. Ann. Math. Statist. 29(2), 571–575 (1958). ISSN 00034851. http://www.jstor.org/stable/2237349
Lele, S.R., Moreno, M., Bayne, E.: Dealing with detection error in site occupancy surveys: what can we do with a single survey? J. Plant Ecol. 5(1), 22–31 (2012)
Little, R.J.: Regression with missing x’s: a review. J. Am. Statist. Assoc. 87(420), 1227–1237 (1992)
MacKenzie, D.I., Nichols, J.D., Lachman, G.B., Droege, S., Royle, A., Langtimm, C.A.: Estimating site occupancy rates when detection probabilities are less than one. Ecology 83(8), 2248–2255 (2002)
Menon, A., van Rooyen, B., Ong, C., Williamson, R.: Learning from corrupted binary labels via class-probability estimation. Journal Machine Learning Research, vol. 16 (2015)
Natarajan, N., Dhillon, I., Ravikumar, P., Tewari, A.: Learning with noisy labels. In: Advances in Neural Information Processing Systems (2013)
Nelder, J., Wedderburn, R.: Generalized linear models. J. R. Statist. Soc. 135(3), 370–384 (1972)
Olteanu, A., Castillo, C., Diaz, F., Kiciman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. CoRR (2016)
Robins, J., Morgenstern, H.: The foundations of confounding in epidemiology. Comput. Math. Appl. 14, 869–916 (1987)
Royle, J.A.: N-Mixture models for estimating population size from spatially replicated counts. Biometrics 60(1), 108–115 (2004)
Schölkopf, B., et al.: Removing systematic errors for exoplanet search via latent causes. In: Proceedings of the 32nd International Conference on Machine Learning (ICML) (2015)
Servén, D., Brummitt, C.: pygam: Generalized additive models in python, March 2018. https://doi.org/10.5281/zenodo.1208723
Shankar, S., Sheldon, D., Sun, T., Pickering, J., Dietterich, T.: Three-quarter sibling regression for denoising observational data, pp. 5960–5966 (2019). https://doi.org/10.24963/ijcai.2019/826
Sharma, A.: Necessary and probably sufficient test for finding valid instrumental variables. CoRR, abs/1812.01412 (2018)
Sólymos, P., Lele, S.R.: Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods Ecol. Evol. 7(2), 196–205 (2016)
White, H.: Estimation, Inference and Specification Analysis. Econometric Society Monographs. Cambridge University Press, Cambridge (1994). https://doi.org/10.1017/CCOL0521252806
Yu, J., Hutchinson, R.A., Wong, W.-K.: A latent variable model for discovering bird species commonly misidentified by citizen scientists. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)
Zhang, Y., Jenkins, D., Manimaran, S., Johnson, W.: Alternative empirical bayes models for adjusting for batch effects in genomic studies. BMC Bioinformatics, vol. 19 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Shankar, S., Sheldon, D. (2021). Sibling Regression for Generalized Linear Models. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_48
Download citation
DOI: https://doi.org/10.1007/978-3-030-86520-7_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)