[go: up one dir, main page]

Skip to main content

Sibling Regression for Generalized Linear Models

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12976))

  • 2132 Accesses

Abstract

Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    These properties can be obtained by applying the aforementioned exponential family properties.

  2. 2.

    Details in the Appendix.

  3. 3.

    https://www.discoverlife.org/moth.

  4. 4.

    The global model is included as a rough guide for the best possible generalization performance, even though it does not solve the task of denoising data within each year.

References

  • Adams, R., Ji, Y., Wang, X., Saria, S.: Learning models from data with measurement error: tackling underreporting. arXiv:1901.09060 (2019)

  • Athey, S., Imbens, G.: Recursive partitioning for heterogeneous causal effects. Proc. Nat. Acad. Sci. 113, 7353–7360 (2016)

    Article  MathSciNet  Google Scholar 

  • Bang, H., Robins, J.: Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973 (2005)

    Google Scholar 

  • Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, vol. 571. John Wiley & Sons, Hoboken (2005)

    Google Scholar 

  • Chouldechova, A., Benavides-Prado, D., Fialko, O., Vaithianathan. R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp. 134–148. PMLR (2018)

    Google Scholar 

  • Davidson, T., Bhattacharya, D., Weber, I.: Racial bias in hate speech and abusive language detection datasets. In: Workshop on Abusive Language Online (2019)

    Google Scholar 

  • Formann, A.K., Kohlmann, T.: Latent class analysis in medical research. Statist. Methods Med. Res. 5(2), 179–211 (1996)

    Article  Google Scholar 

  • Genbäck, M., de Luna, X.: Causal inference accounting for unobserved confounding after outcome regression and doubly robust estimation. Biometrics 75(2), 506–515 (2019)

    Google Scholar 

  • Horton, N.J., Laird, N.M.: Maximum likelihood analysis of generalized linear models with missing covariates. Statist. Methods Med. Res. 8(1), 37–50 (1999)

    Article  Google Scholar 

  • Hutchinson, R.A., He, L., Emerson, S.C.: Species distribution modeling of citizen science data as a classification problem with class-conditional noise. In: AAAI, pp. 4516–4523 (2017)

    Google Scholar 

  • Ibrahim, J.G., Weisberg, S.: Incomplete data in generalized linear models with continuous covariates. Australia J. Statist. 34(3), 461–470 (1992)

    Google Scholar 

  • Ibrahim, J.G., Lipsitz, S.R., Chen, M.-H.: Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Statist. Soc. Ser. B (Statist. Methodol.) 61(1), 173–190 (1999)

    Google Scholar 

  • Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. statist. Assoc. 91(433), 222–230 (1996)

    Article  MathSciNet  Google Scholar 

  • Knape, J., Korner-Nievergelt, F.: On assumptions behind estimates of abundance from counts at multiple sites. Methods Ecol. Evol. 7(2), 206–209 (2016)

    Article  Google Scholar 

  • Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Technique. MIT Press, Cambridge (2009)

    Google Scholar 

  • Kupperman, M.: Probabilities of hypotheses and information-statistics in sampling from exponential-class populations. Ann. Math. Statist. 29(2), 571–575 (1958). ISSN 00034851. http://www.jstor.org/stable/2237349

  • Lele, S.R., Moreno, M., Bayne, E.: Dealing with detection error in site occupancy surveys: what can we do with a single survey? J. Plant Ecol. 5(1), 22–31 (2012)

    Article  Google Scholar 

  • Little, R.J.: Regression with missing x’s: a review. J. Am. Statist. Assoc. 87(420), 1227–1237 (1992)

    Google Scholar 

  • MacKenzie, D.I., Nichols, J.D., Lachman, G.B., Droege, S., Royle, A., Langtimm, C.A.: Estimating site occupancy rates when detection probabilities are less than one. Ecology 83(8), 2248–2255 (2002)

    Article  Google Scholar 

  • Menon, A., van Rooyen, B., Ong, C., Williamson, R.: Learning from corrupted binary labels via class-probability estimation. Journal Machine Learning Research, vol. 16 (2015)

    Google Scholar 

  • Natarajan, N., Dhillon, I., Ravikumar, P., Tewari, A.: Learning with noisy labels. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  • Nelder, J., Wedderburn, R.: Generalized linear models. J. R. Statist. Soc. 135(3), 370–384 (1972)

    Google Scholar 

  • Olteanu, A., Castillo, C., Diaz, F., Kiciman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. CoRR (2016)

    Google Scholar 

  • Robins, J., Morgenstern, H.: The foundations of confounding in epidemiology. Comput. Math. Appl. 14, 869–916 (1987)

    Article  MathSciNet  Google Scholar 

  • Royle, J.A.: N-Mixture models for estimating population size from spatially replicated counts. Biometrics 60(1), 108–115 (2004)

    Article  MathSciNet  Google Scholar 

  • Schölkopf, B., et al.: Removing systematic errors for exoplanet search via latent causes. In: Proceedings of the 32nd International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  • Servén, D., Brummitt, C.: pygam: Generalized additive models in python, March 2018. https://doi.org/10.5281/zenodo.1208723

  • Shankar, S., Sheldon, D., Sun, T., Pickering, J., Dietterich, T.: Three-quarter sibling regression for denoising observational data, pp. 5960–5966 (2019). https://doi.org/10.24963/ijcai.2019/826

  • Sharma, A.: Necessary and probably sufficient test for finding valid instrumental variables. CoRR, abs/1812.01412 (2018)

    Google Scholar 

  • Sólymos, P., Lele, S.R.: Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods Ecol. Evol. 7(2), 196–205 (2016)

    Article  Google Scholar 

  • White, H.: Estimation, Inference and Specification Analysis. Econometric Society Monographs. Cambridge University Press, Cambridge (1994). https://doi.org/10.1017/CCOL0521252806

  • Yu, J., Hutchinson, R.A., Wong, W.-K.: A latent variable model for discovering bird species commonly misidentified by citizen scientists. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)

    Google Scholar 

  • Zhang, Y., Jenkins, D., Manimaran, S., Johnson, W.: Alternative empirical bayes models for adjusting for batch effects in genomic studies. BMC Bioinformatics, vol. 19 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiv Shankar .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shankar, S., Sheldon, D. (2021). Sibling Regression for Generalized Linear Models. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86520-7_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86519-1

  • Online ISBN: 978-3-030-86520-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics