8000 Add scaling to SGDClassifier · Issue #5248 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add scaling to SGDClassifier #5248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Sep 10, 2015 · 22 comments
Closed

Add scaling to SGDClassifier #5248

amueller opened this issue Sep 10, 2015 · 22 comments

Comments

@amueller
Copy link
Member

SGDClassifier only really works well with scaled data. I think we should add some scaling to it by default.

@andylamb
Copy link

Would scaling by default be unexpected behavior?

@andylamb
Copy link

Maybe add a scaler param that can be a function or object?

@amueller
Copy link
Member Author

It's hard to know what users expect. It changes the meaning of the regularization parameter, but not much else.
@agramfort do you remember where the discussion of the normalize in the linear_models is? I currently only see #2601.

For the user to provide functions or objects is a bit inconvenient. Usually we would do strings or objects, so that the most common ways would be easily accessible.
Other linear models have a boolean "normalize" which is not great. We could have a boolean "standardize" which would be slightly better.

@andylamb
Copy link

With strings or objects (vs. a boolean), users could provide different scalers, like MinMaxScaler.

@amueller
Copy link
Member Author

yeah that is why it would be nicer ;)

@amueller
Copy link
Member Author

also it's unclear what to do with sparse / dense input. Not sure how that is done in the other linear models?
Maybe sparse is just not scaled? We could use MaxAbsScaler as default...

@andylamb
Copy link

MinMaxScaler should work for sparse input. Do other linear models include an option to scale? Didn't find any.

@andylamb
Copy link

One more question is how this would work with partial_fit. I suppose it might make sense, depending on how large the initial dataset was.

@MechCoder
Copy link
Member

do you remember where the discussion of the normalize in the linear_models is?

Currently in the linear_models of scikit-learn is that the input data is normalized only if both fit_intercept and normalize is set to True. Also the normalize isn't (X - X_mean) **2 / n but just (X - X_mean) ** 2 (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L89). This was attempted to be fixed by @jnothman and me at #3005 but there were test failures.

Not sure how that is done in the other linear models?

As for sparse_input it is just normalized now and not centered.

Btw, just normalizing down by a fraction would be equivalent would be scaling up the req parameter by the same value right? (assuming just normalizing and not centering)

@MechCoder
Copy link
Member

Do other linear models include an option to scale?

Yes, they are provided in the fit_intercept and normalize option.

@andylamb
Copy link

I think using a scaler property is a little bit more transparent than fit_intercept and normalize? Turning it on by default for SGDClassifier breaks tests though, so maybe not scale by default, but have the option. Other linear models don't scale by default, correct?

@MechCoder
Copy link
Member

Sure. But is there any other advantage that I'm overlooking that can be done along with a Pipeline using StandardScaler?

@andylamb
Copy link

No, seems like it would be mostly for convenience.

@MechCoder
Copy link
Member

Feel free to submit a pull request. Or if you are too busy, I can submit one and we can carry on the discussion from there.

@andylamb
Copy link

Created a draft PR. Will work on test cases.

@larsmans
Copy link
Member

I would call scaling unexpected behavior. But I'm spoiled by TfidfVectorizer, which produces normalized data by default.

@amueller
Copy link
Member Author

My issue is that SGDClassifier basically doesn't work without tuning when data is not scaled properly.
We could also adopt the learning rate scheme but the current default doesn't make sense for non-normal data.

@agramfort
Copy link
Member
agramfort commented Sep 24, 2015 via email

@amueller
Copy link
Member Author

Well yeah we could do that too... probably should

@rth
Copy link
Member
rth commented Feb 27, 2019

My issue is that SGDClassifier basically doesn't work without tuning when data is not scaled properly. [...] I think we should add some scaling to it by default.

Do we still want this? I think using a StandardScaler in a Pipeline is cleaner, and the user should easily be able to see that the model is not giving good results by looking at some evaluation metric, however it's easy to overlook an estimator parameter.

I would call scaling unexpected behavior. But I'm spoiled by TfidfVectorizer, which produces normalized data by default.

My experience as well, in the sense that a large number of users don't realize that TfidfVectorize L2 normalizes the output.

There are also several drawbacks,

  • adding a scaling parameter with default to None (for backward compatibility) doesn't make much sense with respect to using a pipeline
  • while if we do a deprecation cycle to change the default scaler=StandardScaler and I already have a Pipeline with a StandardScaler from a user perspective it is annoying
  • also this would make it harder to have code working with multiple scikit-learn versions.

+1 for closing this (and the associated PR)

cc @amueller

@rth
Copy link
Member
rth commented Jun 14, 2019

Do we still want this?

ping @amueller cf above comment #5248 (comment) and also #3020 (comment) (and the following discussion)

@amueller
Copy link
Member Author
amueller commented Jul 14, 2019

@rth it's inconvenient but I'm ok to close it. You can't do partial fitting in a pipeline ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0