-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Fixes to YeoJohnson transform #12522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a what's new
@@ -2345,6 +2345,16 @@ def test_optimization_power_transformer(method, lmbda): | |||
assert_almost_equal(1, X_inv_trans.std(), decimal=1) | |||
|
|||
|
|||
def test_yeo_johnson_darwin_example(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should try to insist on such tests where possible, but we often forget
doc/whats_new/v0.20.rst
Outdated
@@ -148,6 +148,13 @@ Changelog | |||
in version 0.23. A FutureWarning is raised when the default value is used. | |||
:issue:`12317` by :user:`Eric Chang <chang>`. | |||
|
|||
- |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge weirdness?
pos = x >= 0 | ||
|
||
# when x >= 0 | ||
if lmbda < 1e-19: | ||
if abs(lmbda) < np.spacing(1.): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so the abs
here is the bugfix, right, and the rest is minor stability changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @NicolasHug |
…ybutton * upstream/master: FIX YeoJohnson transform lambda bounds (scikit-learn#12522) [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset (scikit-learn#12541) ENH Prefer threads for IsolationForest (scikit-learn#12543) joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
So, the yeo-johnson transformation parameter MUST be between 0 and 2 to ensure that the domain of the inverse transform is on the real line. What was the reasoning for removing the constraints during the optimization for lambda? |
What do you mean "on the real line"? How could the transformation give complex output if lambda is not in [0, 2]? |
Also note that there never was any constraint on lambda to be in [0, 2]. This PR just fixes the transformations, not the optimization procedure for lambda. The scipy brent optimizer accepts bracket (here -2, 2) but as you can read on the docs it doesn't mean the value will be in the interval. |
Ok got it, x_trans = 2 If x_trans = 2 and l = -0.9900990099009901, then performing the inverse transform you would be raising a negative number to a negative power which would be complex. |
Also, you could use fminbound to perform the optimization and respect the constraints of [0, 2] |
Would you have a reproducing example that would produce such xtrans and lamda? |
Logit Transforming this data, then fitting the yeo-johnson transform produces a lambda value of -1.42272153. Trying to inverse_transform anything greater than -1 / -1.422272153 = 0.7028782364740063 will produce a nan since the inverse operation will attempt to take the root of a negative value. |
Do you have an actual code example? I cannot reprocude your results. import pandas as pd
import numpy as np
from sklearn.preprocessing import PowerTransformer
df = pd.read_csv('success_rate.txt')
X = np.array(df.success_rate).reshape(-1, 1)
t = PowerTransformer(method='yeo-johnson')
t.fit(X)
print(t.lambdas_)
X_trans = t.transform(X)
print((X_trans > (-1 / t.lambdas_)).sum())
print(np.iscomplex(t.inverse_transform(X_trans)).sum())
Also logit-transforming does not give you the expected lambda |
The reason I'm asking for actual code is that you might just be trying to inverse-transform values that aren't in the range of the transformation to begin with. In which case it's just normal that you end up with complex / bad values. |
I can send code, but you are right that with fitted lambda I am taking the
inverse of values outside the range. However, with the above constraints on
lambda the range would be the real line and I wouldn't run into that issue.
Wouldn't it make sense to include an option to restrict lambda to be
between 0 and 2?
…On Fri, Feb 1, 2019, 9:03 AM Nicolas Hug ***@***.*** wrote:
The reason I'm asking for actual code is that you might just be trying to
inverse-transform values that aren't in the range of the transformation to
begin with. In which case it's just normal that you end up with complex /
bad values.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12522 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADOm8KKotVMuZnjQuEBKVtUNfa4rC886ks5vJElMgaJpZM4YO57t>
.
|
If you only get complex values when you try to inverse transform values that shouldn't be inverse transformed (that is, breaking the math, basically), then I don't think we should be handling this use-case. However if it's possible to get complex values by inverse-transforming values that are in the range of possible values of That being said a simple workaround on your side would be e.g. t.lambdas_ = np.clip(t.lambdas_, 0, 2) |
I don't think that workaround would work with scaling so simply since the scaler would have to be re-fit iirc. However, what I'm getting at is that if I want the range of my transformation to be the real line, then I would need 0 <= lambda <= 2 and I would want the range to be the real line so that I could take the inverse transformation of any number on the real line. If this is something that you don't want to be supported, then I guess that's fine, I was just wondering the rationale. |
This reverts commit 2585410.
This reverts commit 2585410.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
While implementing the Yeo-Johnson transformation in scipy scipy/scipy#9305, I noticed a few mistakes I made in my first implementation here (#11520).
This PR:
lambda
and 0/2, I was mistaken in thinking that lambda was necessarily in [0, 2] which isn't the case)log1p
,spacing(1.)
as advised during the scipy PR reviewsAny other comments?
There's a numerical instability issue in scipy's box-cox (scipy/scipy#6873) that is reproducible for yeo-johnson as well. I'm keeping an eye on this and will port the fix once it's merged on scipy.