-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] IterativeImputer: n_iter->max_iter #13061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] IterativeImputer: n_iter->max_iter #13061
Conversation
Note that the new test I've pushed a proposed fix, but I don't know if it's statistically valid. Opinion appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this early stopping criterion avoids dangerous chaos, but does not avoid unnecessary fits and possible minor perturbations when, say, the result has converged but is not about to blow out.
sklearn/impute.py
Outdated
'reached. Using result of round %d' % i_rnd) | ||
break | ||
else: | ||
Xt_previous = np.copy(Xt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we're copying in both the if and the else here. I think maybe this one is needed if Xt keeps being updated, but the other is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. we can do this in the if
branch: Xt = Xt_previous
.
sklearn/impute.py
Outdated
# stop early if difference between consecutive imputations goes up. | ||
# if so, back off to previous imputation | ||
if not self.sample_posterior: | ||
norm_diff = np.linalg.norm(Xt - Xt_previous) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use parentheses rather than \
for line continuation.
Is this an obvious definition of "the difference between imputations"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious but it's documented here: https://academic.oup.com/bioinformatics/article/28/1/112/219101
I'll add it in a comment.
imputed_values[~good_sigmas] = mus[~good_sigmas] | ||
mus = mus[good_sigmas] | ||
sigmas = sigmas[good_sigmas] | ||
# two types of problems: (1) non-positive sigmas, (2) mus outside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman @benlawson This is my proposed solution to the problem of having mus
outside of the legal min_value
and max_value
. However, after implementing this, now this test fails: test_iterative_imputer_truncated_normal_posterior
. So something is off here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, what's happening is that the first prediction via mus, sigmas = predictor.predict(X_test, return_std=True)
is outside of the min/max so it never gets to the truncated draw, and thus the imputations from imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
are all 0.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the random seed, which allows the first draw to be between min/max bounds, and thus results in the test passing, I also turned down 1000 draws to 100 as it seemed unnecessarily slow.
I agree that |
One argument against this early stopping criterion is it broke the
But now, on my machine, early stopping happens after the 2nd iteration and you get:
Which is clearly wrong based on the training data. OK, I'm open to considering a |
Do you mind if we do some benchmarks? E.g. take Boston, insert some MCAR,
and plot the convergence statistic over number of iterations? Could
show both the L2 and L∞ norms?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should issue a ConvergenceWarning if we reach max_iter, or is it not really important for imputation?
I hope to give this a bigger look later/tomorrow.
@@ -511,7 +514,7 @@ class IterativeImputer(BaseEstimator, TransformerMixin): | |||
``feat_idx`` is the current feature to be imputed, | |||
``neighbor_feat_idx`` is the array of other features used to impute the | |||
current feature, and ``predictor`` is the trained predictor used for | |||
the imputation. Length is ``self.n_features_with_missing_ * n_iter``. | |||
the imputation. Length is ``self.n_features_with_missing_ * max_iter``. | |||
|
|||
n_features_with_missing_ : int | |||
Number of features with missing values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document n_iter_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be another list item rather than a note in the above attribute description
sklearn/impute.py
Outdated
@@ -511,7 +514,7 @@ class IterativeImputer(BaseEstimator, TransformerMixin): | |||
``feat_idx`` is the current feature to be imputed, | |||
``neighbor_feat_idx`` is the array of other features used to impute the | |||
current feature, and ``predictor`` is the trained predictor used for | |||
the imputation. Length is ``self.n_features_with_missing_ * n_iter``. | |||
the imputation. Length is ``self.n_features_with_missing_ * max_iter``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_iter -> n_iter_
So that's a problem of stopping too early rather than too late! Hmm... If
it's stopping too early, that basically means that it needs more burn-in
for the imputed values to find some equilibrium? So maybe what you want is
not tol, but something more like n_iter_no_decrease.
|
Haha, now you've got me convinced about I'll try that for now - it's what you had originally: Setting tol as a default of 1e-3 fixes the |
Re: benchmarks. I'll do some tomorrow (it's 9pm here). Re: convergence warning. Can't hurt. I'll find an example and stick it in here. |
Might want to have that max (which is an infty-norm so can maybe use the
scipy norm implementation) divided by the norm of the non-missing values of
the original X to be invariant to scale.
… |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the max_iter change here was pulled out from the rest..?
@@ -511,7 +514,7 @@ class IterativeImputer(BaseEstimator, TransformerMixin): | |||
``feat_idx`` is the current feature to be imputed, | |||
``neighbor_feat_idx`` is the array of other features used to impute the | |||
current feature, and ``predictor`` is the trained predictor used for | |||
the imputation. Length is ``self.n_features_with_missing_ * n_iter``. | |||
the imputation. Length is ``self.n_features_with_missing_ * max_iter``. | |||
|
|||
n_features_with_missing_ : int | |||
Number of features with missing values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be another list item rather than a note in the above attribute description
sklearn/tests/test_impute.py
Outdated
verbose=1, | ||
random_state=rng) | ||
X_filled_100 = imputer.fit_transform(X_missing) | ||
assert(len(imputer.imputation_sequence_) == d * 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that the code is clear you can check n_iter_ directly
Can also check that tol=0 results in n_iter_ == max_iter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and remove the outer brackets
sklearn/impute.py
Outdated
the imputation. Length is ``self.n_features_with_missing_ * n_iter``. | ||
the imputation. Length is ``self.n_features_with_missing_ * | ||
self.n_iter_``. ``self.n_iter_`` is the number of iteration rounds that | ||
actually occurred, taking early stopping into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove actually
sklearn/impute.py
Outdated
Maximum number of imputation rounds to perform before returning the | ||
imputations computed during the final round. A round is a single | ||
imputation of each feature with missing values. The stopping criterion | ||
is met once abs(max(X_i - X_{i-1})) < tol. Note that early stopping is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the backticks around the formulation of the stopping criterion
sklearn/impute.py
Outdated
mus_too_high = mus > self._max_value | ||
imputed_values[mus_too_high] = self._max_value | ||
# the rest can be sampled without statistical issues | ||
sample_flag = positive_sigmas & ~mus_too_low & ~mus_too_high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the term mask instead of flag. Maybe inrange_mask
sklearn/impute.py
Outdated
self.n_iter_ = self.max_iter | ||
if not self.sample_posterior: | ||
Xt_previous = np.copy(Xt) | ||
for i_rnd in range(self.max_iter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a while
loop
while criterion > self.tol or self.n_iter_ < self.max_iter
The reasoning is that we already see that we will have an early stopping mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need if not self.sample_posterior:
so it gets messier. I would prefer to leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idiom I have come to enjoy is:
for self.n_iter_ in range(1, self.max_iter + 1):
update model
if stopping condition is met:
break
else:
warnings.warng(...)
(We have something similar in label_propagation.py, but there the stopping condition applies at the start of the loop)
sklearn/impute.py
Outdated
for i_rnd in range(self.n_iter): | ||
self.n_iter_ = self.max_iter | ||
if not self.sample_posterior: | ||
Xt_previous = np.copy(Xt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.copy(Xt)
-> Xt.copy()
sklearn/impute.py
Outdated
print('[IterativeImputer] Early stopping criterion ' | ||
'reached.') | ||
break | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove the else:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's true. I need to specify that Xt_previous = Xt.copy()
otherwise the subsequent diff will be 0. Try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I mean is
if not self.sample_posterior:
inf_norm = np.linalg.norm(Xt - Xt_previous, ord=np.inf,
axis=None)
if inf_norm < normalized_tol:
self.n_iter_ = i_rnd + 1
if self.verbose > 0:
print('[IterativeImputer] Early stopping criterion '
'reached.')
break
Xt_previous = Xt.copy()
You don't need else since you will break in the if branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I thought you meant that the entire else branch was not needed. Yup, you're right. I'll fix it.
sklearn/impute.py
Outdated
return Xt | ||
|
||
imputations_per_round = len(self.imputation_sequence_) // self.n_iter | ||
imps_per_round = len(self.imputation_sequence_) // self.n_iter_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let the full name, this is more explicit. Split the line into two lines.
sklearn/tests/test_impute.py
Outdated
verbose=1, | ||
random_state=rng) | ||
X_filled_5 = imputer.fit_transform(X_missing) | ||
assert_allclose(X_filled_100, X_filled_5, atol=1e-7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add an assert on n_iter_
. The reason is that we were missing (+1) in some estimator when reporting the number of iteration. it could be quite useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what does this mean? I added this:
imputer = IterativeImputer(max_iter=100,
tol=0,
sample_posterior=False,
verbose=1,
random_state=rng)
imputer.fit(X_missing)
assert imputer.max_iter == imputer.n_iter_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is what I meant. Inverse the assert order. Usually expected come last.
assert imputer.n_iter_ == imputer.max_iter
@jnothman You asked for some convergence plots. I'm plotting inf norm and l2 norm inf norm is l2 norm is Here are two plots. Convergence is quite rapid in both cases, but there are minor improvements in the case of California, while Boston stays flat. The code that generates these is below, but you can't run it because I had to define custom class variables.
|
I don't think it's so weird with sample_posterior... it's maybe even
converging on a reasonably stable distribution.
|
Great. I'll say this is ready for MRG reviews then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good otherwise!
sklearn/impute.py
Outdated
Maximum number of imputation rounds to perform before returning the | ||
imputations computed during the final round. A round is a single | ||
imputation of each feature with missing values. The stopping criterion | ||
is met once `abs(max(X_i - X_{i-1}))/abs(max(X[known_vals]))` < tol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might need to clarify that X_i is X at iteration i. Might be better to use t to index time?
sklearn/impute.py
Outdated
self.n_iter_ = self.max_iter | ||
if not self.sample_posterior: | ||
Xt_previous = np.copy(Xt) | ||
for i_rnd in range(self.max_iter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idiom I have come to enjoy is:
for self.n_iter_ in range(1, self.max_iter + 1):
update model
if stopping condition is met:
break
else:
warnings.warng(...)
(We have something similar in label_propagation.py, but there the stopping condition applies at the start of the loop)
@jnothman I've addressed your two comments. @glemaitre perhaps you'll take a second look? |
I thought about that pattern and gave up thinking that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe th 10000 is comment to others. Learn more.
All good. Just three little tests for the coverage
@@ -879,7 +907,7 @@ def fit_transform(self, X, y=None): | |||
self.initial_imputer_ = None | |||
X, Xt, mask_missing_values = self._initial_imputation(X) | |||
|
|||
if self.n_iter == 0: | |||
if self.max_iter == 0: | |||
return Xt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not covered by a test (as previously). Could you add this case to the unit test?
axis=None) | ||
if inf_norm < normalized_tol: | ||
if self.verbose > 0: | ||
print('[IterativeImputer] Early stopping criterion ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do not test the testing. Turning to 1
could be enough for a couple of tests. We will ensure that printing does not trigger an error.
sklearn/impute.py
Outdated
@@ -942,10 +985,10 @@ def transform(self, X): | |||
|
|||
X, Xt, mask_missing_values = self._initial_imputation(X) | |||
|
|||
if self.n_iter == 0: | |||
if self.n_iter_ == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not tested. Could we force the n_iter_
and check that we have X == Xt
.
@glemaitre I've added some tests to cover the cases you mentioned. |
That's the error:
|
Weird this passed locally. Will investigate...
…On Fri, Feb 8, 2019, 10:36 AM Guillaume Lemaitre ***@***.*** wrote:
That's the error:
[00:15:01] ================================== FAILURES ===================================
[00:15:01] _______________________ test_iterative_imputer_verbose ________________________
[00:15:01]
[00:15:01] def test_iterative_imputer_verbose():
[00:15:01] n = 10
[00:15:01] d = 3
[00:15:01] X = sparse_random_matrix(n, d, density=0.10).toarray()
[00:15:01] imputer = IterativeImputer(missing_values=0, max_iter=1, verbose=1)
[00:15:01] > imputer.fit(X)
[00:15:01]
[00:15:01] X = array([[0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.]])
[00:15:01] d = 3
[00:15:01] imputer = IterativeImputer(imputation_order='ascending', initial_strategy='mean',
[00:15:01] max_iter=1, max_value=None, m...earest_features=None, predictor=None, random_state=None,
[00:15:01] sample_posterior=False, tol=0.001, verbose=1)
[00:15:01] n = 10
[00:15:01]
[00:15:01] c:\python37-x64\lib\site-packages\sklearn\tests\test_impute.py:542:
[00:15:01] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:15:01] c:\python37-x64\lib\site-packages\sklearn\impute.py:1032: in fit
[00:15:01] self.fit_transform(X)
[00:15:01] c:\python37-x64\lib\site-packages\sklearn\impute.py:930: in fit_transform
[00:15:01] normalized_tol = self.tol * np.max(np.abs(X[~mask_missing_values]))
[00:15:01] c:\python37-x64\lib\site-packages\numpy\core\fromnumeric.py:2505: in amax
[00:15:01] initial=initial)
[00:15:01] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:15:01]
[00:15:01] obj = array([], dtype=float64), ufunc = <ufunc 'maximum'>, method = 'max'
[00:15:01] axis = None, dtype = None, out = None
[00:15:01] kwargs = {'initial': <no value>, 'keepdims': <no value>}, passkwargs = {}
[00:15:01]
[00:15:01] def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs):
[00:15:01] passkwargs = {k: v for k, v in kwargs.items()
[00:15:01] if v is not np._NoValue}
[00:15:01]
[00:15:01] if type(obj) is not mu.ndarray:
[00:15:01] try:
[00:15:01] reduction = getattr(obj, method)
[00:15:01] except AttributeError:
[00:15:01] pass
[00:15:01] else:
[00:15:01] # This branch is needed for reductions like any which don't
[00:15:01] # support a dtype.
[00:15:01] if dtype is not None:
[00:15:01] return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
[00:15:01] else:
[00:15:01] return reduction(axis=axis, out=out, **passkwargs)
[00:15:01]
[00:15:01] > return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
[00:15:01] E ValueError: zero-size array to reduction operation maximum which has no identity
[00:15:01]
[00:15:01] axis = None
[00:15:01] dtype = None
[00:15:01] kwargs = {'initial': <no value>, 'keepdims': <no value>}
[00:15:01] method = 'max'
[00:15:01] obj = array([], dtype=float64)
[00:15:01] out = None
[00:15:01] passkwargs = {}
[00:15:01] ufunc = <ufunc 'maximum'>
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#13061 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABya7GOGy4DbuqvVi5mzaFQ8ReoXEfPvks5vLcO2gaJpZM4aWqTq>
.
|
Fascinating - it accidentally made a test matrix that had no known values. I guess I should add a check for that and return without doing anything because you can't make an inference if you have 0 info? |
@glemaitre I fixed that issue and made a few more very small changes to prevent other problems. Plus one more tests. |
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
Thanks, applied. |
@jnothman @glemaitre I renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one nitpick otherwise LGTM
d = 3 | ||
X = sparse_random_matrix(n, d, density=0.10).toarray() | ||
X = sparse_random_matrix(n, d, density=0.10, random_state=rng).toarray() | ||
imputer = IterativeImputer(missing_values=0, max_iter=1, verbose=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you put verbose at 2. I think that we have a case that we want verbose > 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this two lines down (545):
imputer.verbose = 2
imputer.transform(X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I'll just do it twice.
@glemaitre Ready to merge! |
Thanks once more @sergeyf |
This PR addresses a number of discussion, most recently in #11977.
The main purpose of this PR is to provide automatic early stopping. We are going to be using the early stopping rule that is used in the missForest R package:
This is a sensible criterion, and has the added benefit of not needing t 10000 o specify a
tol
.Note that this criterion is not applied when
sample_posterior=True
because there is no steady state.This PR does a few more things:
sample_posterior=True
usingtruncnormal
to sample from the posterior.