8000 Incremental weighted mean and var by panpiort8 · Pull Request #15798 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Incremental weighted mean and var #15798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

panpiort8
Copy link
Contributor
@panpiort8 panpiort8 commented Dec 5, 2019

Reference Issues/PRs

Partially adresses: #15601

What does this implement/fix? Explain your changes.

Method partial_fit in StandardScaler can be used multiple times to incrementally update mean and variance, but there was no proper method to do it with sample_weights. This PR introduce _incremental_weighted_mean_and_var, which does the exact thing we want.

Any other comments?

  1. Basic equations I used can be found here, but I haven't found any (free) papers with derivation leading to batch version of them. It's basic math, but code may seem confusing without that sort of explanation. If it's really necessary I can provide brief paper.
  2. NaNs handling is somehow inelegant, but I haven't found any 'weighted' method similar to np.nanvar/np.nanmean.

@panpiort8 panpiort8 changed the title Inremental weighted mean and var [WIP] Inremental weighted mean and var Dec 5, 2019
@panpiort8 panpiort8 changed the title [WIP] Inremental weighted mean and var [WIP] Incremental weighted mean and var Dec 5, 2019
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Do you think then that adding sample_weight support to _incremental_mean_and_var would be more difficult to maintain then creating a separate method?

@panpiort8
Copy link
Contributor Author

I see two possibiliites:

  1. Code within these two methods is quite different, so sample_weight support in _incremental_mean_and_var would be one big if clause conditioned on sample_weight == None. It still requires separate testing, so I don't see much gain in maintainance effort. Moreover that option allows some additional semantic nonsens: you can imagine sequence of calls in which each step input is dependent on previous step output. In each step you feed method with sensible sample_weight, except of the last one, when you accidentally feed method with sample_weight=None. No exception is thrown, but output is neither weighted nor unweighted mean and variance.
  2. Other possiblity is to treat _incremental_mean_and_var as special case of _incremental_weighted_mean_and_var with sample_weights=[1,...,1]. Then maintainance will be much simpler, but I don't believe that's great idea: calculating stats without weights is simpler, faster (I guess) and more accurate (again - I guess)

I would rather have separate methods, but I'm opened for arguments/propostitions.

@panpiort8 panpiort8 force-pushed the inremental_weighted_mean_and_var branch from bd4c1d8 to acd5afb Compare December 9, 2019 18:00
@panpiort8 panpiort8 changed the title [WIP] Incremental weighted mean and var Incremental weighted mean and var Dec 9, 2019
@panpiort8 panpiort8 force-pushed the inremental_weighted_mean_and_var branch 2 times, most recently from a046d24 to 3160df3 Compare December 10, 2019 08:36
@panpiort8 panpiort8 force-pushed the inremental_weighted_mean_and_var branch from 3160df3 to 9d3115f Compare December 10, 2019 09:40
@panpiort8
Copy link
Contributor Author

Testing is done and I believe code is ready to be reviewed.

@jnothman
Copy link
Member

It's really quite inconvenient that you close and reopen this issue. it provides no link between the old and new issues, means that the labels on the previous issue are removed, etc. Please avoid this in the future.

@panpiort8
Copy link
Contributor Author

True, it was misguided decision, mea culpa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0