10000 [MRG] Fixes to YeoJohnson transform by NicolasHug · Pull Request #12522 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Fixes to YeoJohnson transform #12522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 8, 2018
Merged

Conversation

NicolasHug
Copy link
Member

Reference Issues/PRs

What does this implement/fix? Explain your changes.

While implementing the Yeo-Johnson transformation in scipy scipy/scipy#9305, I noticed a few mistakes I made in my first implementation here (#11520).

This PR:

  • fixes the comparison between lambda and 0/2, I was mistaken in thinking that lambda was necessarily in [0, 2] which isn't the case)
  • uses more appropriate numpy builtins log1p, spacing(1.) as advised during the scipy PR reviews
  • adds a sanity check test from the original Yeo-Johnson paper

Any other comments?

There's a numerical instability issue in scipy's box-cox (scipy/scipy#6873) that is reproducible for yeo-johnson as well. I'm keeping an eye on this and will port the fix once it's merged on scipy.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a what's new

10000
@@ -2345,6 +2345,16 @@ def test_optimization_power_transformer(method, lmbda):
assert_almost_equal(1, X_inv_trans.std(), decimal=1)


def test_yeo_johnson_darwin_example():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should try to insist on such tests where possible, but we often forget

@@ -148,6 +148,13 @@ Changelog
in version 0.23. A FutureWarning is raised when the default value is used.
:issue:`12317` by :user:`Eric Chang <chang>`.

- |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge weirdness?

@amueller amueller added this to the 0.20.1 milestone Nov 7, 2018
pos = x >= 0

# when x >= 0
if lmbda < 1e-19:
if abs(lmbda) < np.spacing(1.):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so the abs here is the bugfix, right, and the rest is minor stability changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jnothman jnothman merged commit 042843a into scikit-learn:master Nov 8, 2018
@jnothman
Copy link
Member
jnothman commented Nov 8, 2018

Thanks @NicolasHug

thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 9, 2018
…ybutton

* upstream/master:
  FIX YeoJohnson transform lambda bounds (scikit-learn#12522)
  [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset  (scikit-learn#12541)
  ENH Prefer threads for IsolationForest (scikit-learn#12543)
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
@gammadistribution
Copy link
gammadistribution commented Jan 31, 2019

So, the yeo-johnson transformation parameter MUST be between 0 and 2 to ensure that the domain of the inverse transform is on the real line. What was the reasoning for removing the constraints during the optimization for lambda?

@NicolasHug
Copy link
Member Author

What do you mean "on the real line"? How could the transformation give complex output if lambda is not in [0, 2]?

@NicolasHug
Copy link
Member Author

Also note that there never was any constraint on lambda to be in [0, 2]. This PR just fixes the transformations, not the optimization procedure for lambda.

The scipy brent optimizer accepts bracket (here -2, 2) but as you can read on the docs it doesn't mean the value will be in the interval.

@gammadistribution
Copy link
gammadistribution commented Jan 31, 2019

Ok got it,

x_trans = 2
lmbda = -0.9900990099009901

If x_trans = 2 and l = -0.9900990099009901, then performing the inverse transform you would be raising a negative number to a negative power which would be complex.

@gammadistribution
Copy link

Also, you could use fminbound to perform the optimization and respect the constraints of [0, 2]

@NicolasHug
Copy link
Member Author

Would you have a reproducing example that would produce such xtrans and lamda?

@gammadistribution
Copy link

Logit Transforming this data, then fitting the yeo-johnson transform produces a lambda value of -1.42272153. Trying to inverse_transform anything greater than -1 / -1.422272153 = 0.7028782364740063 will produce a nan since the inverse operation will attempt to take the root of a negative value.

success_rate.txt

@NicolasHug
Copy link
Member Author

Do you have an actual code example? I cannot reprocude your results.

import pandas as pd
import numpy as np
from sklearn.preprocessing import PowerTransformer

df = pd.read_csv('success_rate.txt')

X = np.array(df.success_rate).reshape(-1, 1)

t = PowerTransformer(method='yeo-johnson')
t.fit(X)
print(t.lambdas_)

X_trans = t.transform(X)
print((X_trans > (-1 / t.lambdas_)).sum())

print(np.iscomplex(t.inverse_transform(X_trans)).sum())
[-1.42272153]
1327
0

Also logit-transforming does not give you the expected lambda

@NicolasHug
Copy link
Member Author

The reason I'm asking for actual code is that you might just be trying to inverse-transform values that aren't in the range of the transformation to begin with. In which case it's just normal that you end up with complex / bad values.

@gammadistribution
Copy link
gammadistribution commented Feb 1, 2019 via email

@NicolasHug
Copy link
Member Author

If you only get complex values when you try to inverse transform values that shouldn't be inverse transformed (that is, breaking the math, basically), then I don't think we should be handling this use-case.

However if it's possible to get complex values by inverse-transforming values that are in the range of possible values of transform the we should take a closer look at this problem.

That being said a simple workaround on your side would be e.g.

t.lambdas_ = np.clip(t.lambdas_, 0, 2)

@gammadistribution
Copy link

I don't think that workaround would work with scaling so simply since the scaler would have to be re-fit iirc. However, what I'm getting at is that if I want the range of my transformation to be the real line, then I would need 0 <= lambda <= 2 and I would want the range to be the real line so that I could take the inverse transformation of any number on the real line.

If this is something that you don't want to be supported, then I guess that's fine, I was just wondering the rationale.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0