10000 "normalize" parameter in sklearn.linear_model should be "standardize" · Issue #16445 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

"normalize" parameter in sklearn.linear_model should be "standardize" #16445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PythonRSAS opened this issue Feb 14, 2020 · 8 comments
Closed

Comments

@PythonRSAS
Copy link
PythonRSAS commented Feb 14, 2020

Describe the issue linked to the documentation

In different sklearn.linear_model classes such as ridge and ridgeCV, the normalize parameter means actually standardize. This misnomer can cause lots of unnecessary confusion.

What normalize means in general is to make the length of vector norm 1. This is clearly not ridge regression or lasso or other regularized linear model does.

Suggest a potential alternative/fix

rename the parameter as standardize instead.

Please see the discussion here:
https://stackoverflow.com/questions/60216879/what-does-sklearn-linear-model-ridgecv-normalize-parameter-exactly-do/60233425#60233425

from sklearn.datasets import load_boston
dataset =load_boston()
X =dataset.data
y=dataset.target

clf = RidgeCV(normalize=True,alphas=[1e-3, 1e-2, 1e-1, 1]).fit(X, y)
clf.coef_
print(clf.alpha_)
print(clf.score(X,y))
print(clf.coef_)
coef =pd.DataFrame(zip(dataset.feature_names,clf.coef_)) #match SAS
@zhengruifeng
Copy link
Contributor

agree, current normalize is misleading

@jnothman
Copy link
Member
jnothman commented Feb 16, 2020 via email

@rth
Copy link
Member
rth commented Feb 16, 2020

Another solution is to deprecate an remove this parameter in favor of using a StandardScaler in a Pipeline, see #3020 (comment) since it may not be applied consistently at the moment as well.

@glemaitre
Copy link
Member

@rth Not sure it is as straightforward. It might be less efficient (if a user does know some internal) because there is some in-place operation in the current behaviour and in case of sparse input, we don't remove the mean. When making a pipeline, one will have to set with_mean=False and copy=False to have similar behaviour.

I also find the documentation misleading when introducing the StandardScaler.

Would we consider that this normalize parameter is a kind of preconditioner? I am just thinking about #15583 where we are tending to add a parameter to exactly make this kind of standardization.

And final question: do we have any linear model (regressor and classifier) which would not benefit from standardizing the data? If actually all models, in the general use-case would benefit of such preprocessing and that some model are already having it, we could introduce it inside the base class to be shared across all of them. I think that there is some case where we should not scale (as in MNIST with logistic regression) and thus we should have the option to do so.

@glemaitre
Copy link
Member

Oh I forgot to ping @agramfort
Since this could be related to the preconditioning in LinearRegression, ping @amueller @ogrisel

@rth
Copy link
Member
rth commented Feb 16, 2020

Not sure it is as straightforward. It might be less efficient (if a user does know some internal) because there is some in-place operation in the current behaviour and in case of sparse input, we don't remove the mean. When making a pipeline, one will have to set with_mean=False and copy=False to have similar behaviour.

I think in most cases making one extra copy of the data will not matter too much calculation time wise as it's negligible as compared to the optimizer run time. For the rare case where it is a problem (e.g. due to memory constrains), the user can specify copy=False in StandardScaler. We are making unnecessary copies as it is #13988. Generally using a separate scaler case, makes it more natural to switch to other scaler (e.g. robust to outliers etc).

For the sparse input, yes I also find the current default annoying as they are unusable with sparse. Maybe we should make with_mean='auto' default, an only apply it in the dense case? Granted linear models are likely to be more frequently used on high dimensional sparse input, but that issue also affects other use cases of StandardScaler.

Would we consider that this normalize parameter is a kind of preconditioner? I am just thinking about #15583 where we are tending to add a parameter to exactly make this kind of standardization.

For me a precondition should be transparent to the user, as is the case in liblinear or as proposed in that PR if I remember correctly, i.e. not change the computed coefficients.

Generally I outlined the motivation for this deprecation in #3020 (comment) the main being that linear models are currently inconsistent, and I'm not sure that adding this parameter to models that don't have it (linear or otherwise), or ensuring that it is consistently applied in combination with other parameters (e.g. fit_intercept) is worth the maintenance time.

@agramfort
Copy link
Member

I suggested in the past to rename normalize to standardize

I am also no worried about the memory copy that pipeline with StandardScaler would do as we have a copy in linear models in this case too. The problem with sparse input where StandardScaler does not center sparse data is for me the issue that is not easy to fix.

@thomasjpfan
Copy link
Member

With the recent deprecation of normalized (#17772, #17743, etc), I think we can close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants
0