Sibling Regression for Generalized Linear Models

Shiv Shankar¹³ &
Daniel Sheldon^13,14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12976))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2132 Accesses

Abstract

Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fourth-corner correlation is a score test statistic in a log-linear trait–environment model that is useful in permutation testing

Article Open access 27 March 2017

Finite Population Survey Sampling: An Unapologetic Bayesian Perspective

Article 08 April 2024

Chapter 13: Multivariate Analysis of Variation

Notes

1.
These properties can be obtained by applying the aforementioned exponential family properties.
2.
Details in the Appendix.
3.
https://www.discoverlife.org/moth.
4.
The global model is included as a rough guide for the best possible generalization performance, even though it does not solve the task of denoising data within each year.

References

Adams, R., Ji, Y., Wang, X., Saria, S.: Learning models from data with measurement error: tackling underreporting. arXiv:1901.09060 (2019)
Athey, S., Imbens, G.: Recursive partitioning for heterogeneous causal effects. Proc. Nat. Acad. Sci. 113, 7353–7360 (2016)
Article MathSciNet Google Scholar
Bang, H., Robins, J.: Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973 (2005)
Google Scholar
Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, vol. 571. John Wiley & Sons, Hoboken (2005)
Google Scholar
Chouldechova, A., Benavides-Prado, D., Fialko, O., Vaithianathan. R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp. 134–148. PMLR (2018)
Google Scholar
Davidson, T., Bhattacharya, D., Weber, I.: Racial bias in hate speech and abusive language detection datasets. In: Workshop on Abusive Language Online (2019)
Google Scholar
Formann, A.K., Kohlmann, T.: Latent class analysis in medical research. Statist. Methods Med. Res. 5(2), 179–211 (1996)
Article Google Scholar
Genbäck, M., de Luna, X.: Causal inference accounting for unobserved confounding after outcome regression and doubly robust estimation. Biometrics 75(2), 506–515 (2019)
Google Scholar
Horton, N.J., Laird, N.M.: Maximum likelihood analysis of generalized linear models with missing covariates. Statist. Methods Med. Res. 8(1), 37–50 (1999)
Article Google Scholar
Hutchinson, R.A., He, L., Emerson, S.C.: Species distribution modeling of citizen science data as a classification problem with class-conditional noise. In: AAAI, pp. 4516–4523 (2017)
Google Scholar
Ibrahim, J.G., Weisberg, S.: Incomplete data in generalized linear models with continuous covariates. Australia J. Statist. 34(3), 461–470 (1992)
Google Scholar
Ibrahim, J.G., Lipsitz, S.R., Chen, M.-H.: Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Statist. Soc. Ser. B (Statist. Methodol.) 61(1), 173–190 (1999)
Google Scholar
Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. statist. Assoc. 91(433), 222–230 (1996)
Article MathSciNet Google Scholar
Knape, J., Korner-Nievergelt, F.: On assumptions behind estimates of abundance from counts at multiple sites. Methods Ecol. Evol. 7(2), 206–209 (2016)
Article Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Technique. MIT Press, Cambridge (2009)
Google Scholar
Kupperman, M.: Probabilities of hypotheses and information-statistics in sampling from exponential-class populations. Ann. Math. Statist. 29(2), 571–575 (1958). ISSN 00034851. http://www.jstor.org/stable/2237349
Lele, S.R., Moreno, M., Bayne, E.: Dealing with detection error in site occupancy surveys: what can we do with a single survey? J. Plant Ecol. 5(1), 22–31 (2012)
Article Google Scholar
Little, R.J.: Regression with missing x’s: a review. J. Am. Statist. Assoc. 87(420), 1227–1237 (1992)
Google Scholar
MacKenzie, D.I., Nichols, J.D., Lachman, G.B., Droege, S., Royle, A., Langtimm, C.A.: Estimating site occupancy rates when detection probabilities are less than one. Ecology 83(8), 2248–2255 (2002)
Article Google Scholar
Menon, A., van Rooyen, B., Ong, C., Williamson, R.: Learning from corrupted binary labels via class-probability estimation. Journal Machine Learning Research, vol. 16 (2015)
Google Scholar
Natarajan, N., Dhillon, I., Ravikumar, P., Tewari, A.: Learning with noisy labels. In: Advances in Neural Information Processing Systems (2013)
Google Scholar
Nelder, J., Wedderburn, R.: Generalized linear models. J. R. Statist. Soc. 135(3), 370–384 (1972)
Google Scholar
Olteanu, A., Castillo, C., Diaz, F., Kiciman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. CoRR (2016)
Google Scholar
Robins, J., Morgenstern, H.: The foundations of confounding in epidemiology. Comput. Math. Appl. 14, 869–916 (1987)
Article MathSciNet Google Scholar
Royle, J.A.: N-Mixture models for estimating population size from spatially replicated counts. Biometrics 60(1), 108–115 (2004)
Article MathSciNet Google Scholar
Schölkopf, B., et al.: Removing systematic errors for exoplanet search via latent causes. In: Proceedings of the 32nd International Conference on Machine Learning (ICML) (2015)
Google Scholar
Servén, D., Brummitt, C.: pygam: Generalized additive models in python, March 2018. https://doi.org/10.5281/zenodo.1208723
Shankar, S., Sheldon, D., Sun, T., Pickering, J., Dietterich, T.: Three-quarter sibling regression for denoising observational data, pp. 5960–5966 (2019). https://doi.org/10.24963/ijcai.2019/826
Sharma, A.: Necessary and probably sufficient test for finding valid instrumental variables. CoRR, abs/1812.01412 (2018)
Google Scholar
Sólymos, P., Lele, S.R.: Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods Ecol. Evol. 7(2), 196–205 (2016)
Article Google Scholar
White, H.: Estimation, Inference and Specification Analysis. Econometric Society Monographs. Cambridge University Press, Cambridge (1994). https://doi.org/10.1017/CCOL0521252806
Yu, J., Hutchinson, R.A., Wong, W.-K.: A latent variable model for discovering bird species commonly misidentified by citizen scientists. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)
Google Scholar
Zhang, Y., Jenkins, D., Manimaran, S., Johnson, W.: Alternative empirical bayes models for adjusting for batch effects in genomic studies. BMC Bioinformatics, vol. 19 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Massachusetts, Amherst, MA, 01003, USA
Shiv Shankar & Daniel Sheldon
Mount Holyoke College, South Hadley, MA, 01075, USA
Daniel Sheldon

Authors

Shiv Shankar
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Sheldon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiv Shankar .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 141 KB)

Supplementary material 2 (pdf 289 KB)

Supplementary material 3 (py 8 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shankar, S., Sheldon, D. (2021). Sibling Regression for Generalized Linear Models. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_48

Download citation

DOI: https://doi.org/10.1007/978-3-030-86520-7_48
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)