8000 [MRG] Add Yeo-Johnson transform to PowerTransformer by NicolasHug · Pull Request #11520 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Add Yeo-Johnson transform to PowerTransformer #11520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Jul 20, 2018

Conversation

NicolasHug
Copy link
Member
@NicolasHug NicolasHug commented Jul 14, 2018

Reference Issues/PRs

Closes #10261

What does this implement/fix? Explain your changes.

This PR implements the Yeo-Johnson transform as part of the PowerTransformer class.

PowerTransformer currently only support Box-Cox which only works for positive values, Yeo-Johnson works for the whole real line.

Original paper : link.

TODO:

  • Write transform
  • Fix lambda param estimation
  • Write inverse transform
  • Write docs
  • Write tests
  • Update examples

Any other comments?

The lambda parameter estimation is a bit tricky and currently does not work. (should be OK now, see below). Unlike for Box-Cox there's no scipy built-in that we can rely on. I'm having a hard time finding decent guidelines, tried to implement likelihood maximization with the brent optimizer (just like for Box-Cox) but run into overflow issues.

The transform code seems to work though:

boxcox_yeojohnson

which is a reproduction of

blah
From Quantile regression via vector generalized additive models by Thomas W. Yee.

Code for figure (hacky):

import numpy as np
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt

yj = PowerTransformer(method='yeo-johnson', standardize=False)
bc = PowerTransformer(method='box-cox', standardize=False)

X = np.arange(-4, 4, .1).reshape(-1, 1)
fig, axes = plt.subplots(ncols=2)

for lmbda in (0, .5, 1, 1.5, 2):
    X_pos = X[X > 0].reshape(-1, 1)
    bc.fit(X_pos)
    bc.lambdas_ = [lmbda]
    X_trans = bc.transform(X_pos)
    axes[0].plot(X_pos, X_trans, label=r'$\lambda = {}$'.format(lmbda))
    axes[0].set_title('Box-Cox')

    yj.fit(X)
    yj.lambdas_ = [lmbda]
    X_trans = yj.transform(X)
    axes[1].plot(X, X_trans, label=r'$\lambda = {}$'.format(lmbda))
    axes[1].set_title('Yeo-Johnson')

for ax in axes:
    ax.set(xlim=[-4, 4], ylim=[-5, 5], aspect='equal')
    ax.legend()
    ax.grid()

plt.show()

The issue was from an error in the log likelihood function
@NicolasHug
Copy link
Member Author

Lambda param estimation should be fixed now, thanks @amueller.

Replication of this example with Yeo-Johnson instead of Box-Cox:

figure_1

# get rid of them to compute them.
_, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None)
col_trans = boxcox(col, lmbda)
else: # neo-johnson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeo?

Copy link
Member
@ogrisel ogrisel Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the white rabbit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this took me a while.

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think it's working now, right? So we need a test for the optimization, and then documentation and adding it to an example?

# when x >= 0
if lmbda < 1e-19:
out[pos] = np.log(x[pos] + 1)
else: #lmbda != 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after #

n = x.shape[0]

# Estimated mean and variance of the normal distribution
mu = psi.sum() / n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need from __future__ import division?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's here already

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, was hard to see from the diff and I was lazy ;)

@@ -2076,7 +2078,7 @@ def test_power_transformer_strictly_positive_exception():
pt.fit, X_with_negatives)

assert_raise_message(ValueError, not_positive_message,
power_transform, X_with_negatives)
power_transform, X_with_negatives, 'box-cox')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? The default value shouldn't change, right? Or do we want to start a cycle to change the default to yeo-johnson?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it clearer and explicit?

I don't know if we'll change the default but it should still be fine as PowerTransform hasn't been released yet AFAIK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. We should discuss before the release. I think yeo-johnson would make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that Yeo-Johnson accepts negative values while Box-Cox does not makes me feel like we should use it by default. From a usability point of view, it's nicer to our users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same feeling. Plus, it is designed to be a generalization of Box-Cox, even though that's not strictly the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall I change the default then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pt = PowerTransformer(method=method, standardize=False)
pt.lambdas_ = [lmbda]
X_inv = pt.inverse_transform(X)
pt.lambdas_ = [9999] # just to make sure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

del pt.lambdas_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, create a new pt object from scratch to make the motivation of the test easier to read:

ground_truth_transform = PowerTransformer(method=method, standardize=False)
ground_truth_transform.lambdas_ = [lmbda]
X_inv = pt.inverse_transform(X)

estimated_transform = PowerTransformer(method=method, standardize=False)
X_inv_trans = estimated_transform.fit_transform(X_inv) 

X_inv_trans = pt.fit_transform(X_inv)

assert_almost_equal(0, np.linalg.norm(X - X_inv_trans) / n_samples,
decimal=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add an assertion that checks that X_inv_trans.mean(axis=0) is close to [0.] and X_inv_trans.std(axis=0) is close to [1.].


rng = np.random.RandomState(0)
n_samples = 1000
X = rng.normal(size=(n_samples, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the test more explicit you can write: X = rng.normal(loc=0., scale=1., size=(n_samples, 1))

@amueller
Copy link
Member

If we want this to be the default then this is a blocker, right?

lmbda_no_nans = pt.lambdas_[0]

# concat nans at the end and check lambda stays the same
X = np.concatenate([X, np.full_like(X, np.nan)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure that the location of the NaNs does not impact the estimation:

from sklearn.utils import shuffle
...

X = np.concatenate([X, np.full_like(X, np.nan)])
X = shuffle(X, random_state=0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



@pytest.mark.parametrize("method, lmbda", [('box-cox', .5),
('yeo-johnson', .1)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more values for lmbda for each method? E.g.:

[
 ('box-cox', .1),
 ('box-cox', .5),
 ('yeo-johnson', .1),
 ('yeo-johnson', .5),
 ('yeo-johnson', 1.),
]

applied to six different probability distributions: Lognormal, Chi-squared,
Weibull, Gaussian, Uniform, and Bimodal.
The power transform is useful as a transformation in modeling problems where
homoscedasticity and normality are desired. Below are examples of Box-Cox and
Copy link
Member
@ogrisel ogrisel Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what "modeling problems where homoscedasticity is desired" mean in this context: to me heteroscedasticity is a property of the noise of the output variable that is not the same for different regions of the input space a conditional model.

It does not seem trivial how power transform can improve homoscedasticity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this statement seems to be correct:

http://article.sapub.org/10.5923.j.ajms.20180801.02.html

It might be interesting to try to come up with a good example to show this corrective effect in a (maybe synthetic) linear regression problem. However, this is probably outside of the scope of the current PR.

@amueller amueller added this to the 0.20 milestone Jul 15, 2018
@amueller
Copy link
Member

tagged for 0.20 and added blocker label. I don't like that we keep adding stuff but if we want to make it default we should do it now.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of opened comments.
If I am not wrong we should have something in the common estimator_checks which force the input to be positive to work with box-cox. We probably want to change this behavior with we change the default.

The power transform method. Currently, 'box-cox' (Box-Cox transform)
is the only option available.
method : str, (default='yeo-johnson')
The power transform method. Available methods are 'box-cox' and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe have a bullet point list for each method referring to the reference section.

@@ -2490,12 +2494,18 @@ def fit(self, X, y=None):
self.lambdas_ = []
transformed = []

opt_fun = {'box-cox': self._box_cox_optimize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expect func instead of fun :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optim_function

opt_fun = {'box-cox': self._box_cox_optimize,
'yeo-johnson': self._yeo_johnson_optimize
}[self.method]
trans_fun = {'box-cox': boxcox,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably transform_function is not so long to be called


return x_inv

def _yeo_johnson_transform(self, x, lmbda):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot just define the forward transform and take 1 / _yeo_johnson_transform for the inverse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inverse here means f^{-1}, not 1 / f

"""Return the negative log likelihood of the observed data x as a
function of lambda."""
psi = self._yeo_johnson_transform(x, lmbda)
n = x.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_samples instead

"""Return the negative log likelihood of the observed data x as a
function of lambda."""
psi = self._yeo_johnson_transform(x, lmbda)
n = x.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm missing x most probably

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, can we pass x as an argument as well as in the optimize function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a nested function so x is implicitely passed anyway


# Estimated mean and variance of the normal distribution
mu = psi.sum() / n
sig_sq = np.power(psi - mu, 2).sum() / n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: is sig_sq the variance? If this is the case, you might want to call it var

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the paper's notation. Should I use mean (or mean_) also then?

@NicolasHug
Copy link
Member Author
NicolasHug commented Jul 16, 2018

I made a quick example to illustrate the use of Yeo-Johnson vs. Box-Cox + offset.

As Box-Cox only accepts positive data, one solution is to shift the data by a fixed offset value (typically min(data) + eps):

test

One thing we see is that the "after offset and Box-Cox" isn't as symmetric as the eo-Johnson and most importantly the values are much higher.

Is it worth adding this as an example @amueller? TBH I wouldn't be able to mathematically or intuitively explain those results.

@GaelVaroquaux
Copy link
Member

+1 on comments by @glemaitre . Also, the tests are failing.

Copy link
Member
@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added tests.
We might want to add them as common tests at some point, but it might be for another pull-request.

self._scaler = StandardScaler()
if force_compute_transform:
transformed = self._scaler.fit_transform(transformed)
self._scaler = StandardScaler(copy=self.copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you should be able to use copy=False here, since a copy has already been done just before.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of changes

"""Return inverse-transformed input x following Yeo-Johnson inverse
transform with parameter lambda.

Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes

"""Return transformed input x following Yeo-Johnson transform with
parameter lambda.

Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes

@@ -2566,7 +2720,8 @@ def _check_input(self, X, check_positive=False, check_shape=False,
X : array-like, shape (n_samples, n_features)

check_positive : bool
If True, check that all data is positive an 10000 d non-zero.
If True, check that all data is positive and non-zero (only if
self.method is box-cox).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if self.method=='box-cox'

@glemaitre
Copy link
Member

I am waiting to check the example in the documentation

@NicolasHug
Copy link
Member Author
NicolasHug commented Jul 18, 2018

@glemaitre @ogrisel, I think the plot looks pretty OK now.

@jnothman
Copy link
Member

I'm finding the plots in plot_map_data_to_normal relatively hard to navigate intuitively. It's not a blocker, but I think it needs to look more tabular: at the moment it takes some effort to see that each row is a different transformation; a label on the left of the row would be more helpful.

Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I clarify why plot_all_scaling still only shows box-cox?

@NicolasHug
Copy link
Member Author

Also, having the transformations go from left to right and the datasets from top to bottom doesn't look like it would be infeasible, and would be more familiar from plot_cluster_comparison etc.

Personally I find it easier to compare the transformations when they're stacked on each other, especially since the axes limits are uniform across the plots.

I don't have anything against having the transformation names on the left. It would also make sense to me to have one dataset per column (limiting the plot to 4 rows instead of 8), but that would make the plot wider which can be annoying on mobile.

Thanks for mentioning plot_all_scaling, I missed that one.

@NicolasHug
Copy link
Member Author
NicolasHug commented Jul 19, 2018

Looks like 14e7c32 broke plot_all_scaling on master:

Traceback (most recent call last):
  File "examples/preprocessing/plot_all_scaling.py", line 71, in <module>
    dataset = fetch_california_housing()
  File "/home/nico/dev/sklearn/sklearn/datasets/california_housing.py", line 128, in fetch_california_housing
    cal_housing = joblib.load(filepath)
  File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 578, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/nico/dev/sklearn/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle
    obj = unpickler.load()
  File "/usr/lib64/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/usr/lib64/python3.6/pickle.py", line 1338, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib64/python3.6/pickle.py", line 1388, in find_class
    __import__(module, level=0)
ModuleNotFoundError: No module named 'sklearn.externals._joblib.numpy_pickle'

Should I open an issue for this? I'm not sure if this comes from my env (I created a new one from scratch, still same). Doesn't the CI check that all the examples are passing?

@amueller
Copy link
Member

@NicolasHug it's now "fixed" but you need to remove your scikit_learn_data folder in your home folder.

@NicolasHug
Copy link
Member Author

Thanks, just updated plot_all_scaling

@ogrisel
Copy link
Member
ogrisel commented Jul 20, 2018

The matplotlib rendering of the 2 examples is good enough for now:

https://29575-843222-gh.circle-artifacts.com/0/doc/auto_examples/index.html#preprocessing

Merging. Thanks @NicolasHug for this nice contribution!

@ogrisel ogrisel merged commit 2d232ac into scikit-learn:master Jul 20, 2018
@amueller
Copy link
Member

yay!

@ogrisel
Copy link
Member
ogrisel commented Jul 20, 2018

I agree with @jnothman (#11520 (comment)) that using a layout similar to the cluster comparison plot would improve the readability even further but I don't want to delay the release for this.

@NicolasHug NicolasHug deleted the yeojohnson branch July 20, 2018 20:35
@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jul 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Yeo-Johnson transform as part of PowerTransformer
7 participants
0