Add scaling to SGDClassifier #5248

amueller · 2015-09-10T16:10:10Z

SGDClassifier only really works well with scaled data. I think we should add some scaling to it by default.

andylamb · 2015-09-13T16:36:24Z

Would scaling by default be unexpected behavior?

andylamb · 2015-09-13T16:41:27Z

Maybe add a scaler param that can be a function or object?

amueller · 2015-09-13T17:21:24Z

It's hard to know what users expect. It changes the meaning of the regularization parameter, but not much else.
@agramfort do you remember where the discussion of the normalize in the linear_models is? I currently only see #2601.

For the user to provide functions or objects is a bit inconvenient. Usually we would do strings or objects, so that the most common ways would be easily accessible.
Other linear models have a boolean "normalize" which is not great. We could have a boolean "standardize" which would be slightly better.

andylamb · 2015-09-13T17:49:02Z

With strings or objects (vs. a boolean), users could provide different scalers, like MinMaxScaler.

amueller · 2015-09-13T18:00:34Z

yeah that is why it would be nicer ;)

amueller · 2015-09-13T18:01:26Z

also it's unclear what to do with sparse / dense input. Not sure how that is done in the other linear models?
Maybe sparse is just not scaled? We could use MaxAbsScaler as default...

andylamb · 2015-09-13T18:40:16Z

MinMaxScaler should work for sparse input. Do other linear models include an option to scale? Didn't find any.

andylamb · 2015-09-14T02:39:59Z

One more question is how this would work with partial_fit. I suppose it might make sense, depending on how large the initial dataset was.

MechCoder · 2015-09-14T03:22:12Z

do you remember where the discussion of the normalize in the linear_models is?

Currently in the linear_models of scikit-learn is that the input data is normalized only if both fit_intercept and normalize is set to True. Also the normalize isn't (X - X_mean) **2 / n but just (X - X_mean) ** 2 (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L89). This was attempted to be fixed by @jnothman and me at #3005 but there were test failures.

Not sure how that is done in the other linear models?

As for sparse_input it is just normalized now and not centered.

Btw, just normalizing down by a fraction would be equivalent would be scaling up the req parameter by the same value right? (assuming just normalizing and not centering)

MechCoder · 2015-09-14T03:32:56Z

Do other linear models include an option to scale?

Yes, they are provided in the fit_intercept and normalize option.

andylamb · 2015-09-14T15:08:56Z

I think using a scaler property is a little bit more transparent than fit_intercept and normalize? Turning it on by default for SGDClassifier breaks tests though, so maybe not scale by default, but have the option. Other linear models don't scale by default, correct?

MechCoder · 2015-09-14T16:02:00Z

Sure. But is there any other advantage that I'm overlooking that can be done along with a Pipeline using StandardScaler?

andylamb · 2015-09-15T04:01:14Z

No, seems like it would be mostly for convenience.

MechCoder · 2015-09-15T04:15:19Z

Feel free to submit a pull request. Or if you are too busy, I can submit one and we can carry on the discussion from there.

andylamb · 2015-09-15T05:02:27Z

Created a draft PR. Will work on test cases.

larsmans · 2015-09-15T09:13:33Z

I would call scaling unexpected behavior. But I'm spoiled by TfidfVectorizer, which produces normalized data by default.

amueller · 2015-09-24T20:31:11Z

My issue is that SGDClassifier basically doesn't work without tuning when data is not scaled properly.
We could also adopt the learning rate scheme but the current default doesn't make sense for non-normal data.

agramfort · 2015-09-24T20:44:26Z

adagrad? :)

amueller · 2015-09-24T20:46:49Z

Well yeah we could do that too... probably should

rth · 2019-02-27T08:53:06Z

My issue is that SGDClassifier basically doesn't work without tuning when data is not scaled properly. [...] I think we should add some scaling to it by default.

Do we still want this? I think using a StandardScaler in a Pipeline is cleaner, and the user should easily be able to see that the model is not giving good results by looking at some evaluation metric, however it's easy to overlook an estimator parameter.

I would call scaling unexpected behavior. But I'm spoiled by TfidfVectorizer, which produces normalized data by default.

My experience as well, in the sense that a large number of users don't realize that TfidfVectorize L2 normalizes the output.

There are also several drawbacks,

adding a scaling parameter with default to None (for backward compatibility) doesn't make much sense with respect to using a pipeline
while if we do a deprecation cycle to change the default scaler=StandardScaler and I already have a Pipeline with a StandardScaler from a user perspective it is annoying
also this would make it harder to have code working with multiple scikit-learn versions.

+1 for closing this (and the associated PR)

cc @amueller

rth · 2019-06-14T06:57:47Z

Do we still want this?

ping @amueller cf above comment #5248 (comment) and also #3020 (comment) (and the following discussion)

amueller · 2019-07-14T21:43:50Z

@rth it's inconvenient but I'm ok to close it. You can't do partial fitting in a pipeline ;)

amueller added the Enhancement label Sep 10, 2015

andylamb mentioned this issue Sep 15, 2015

[WIP]. Add scaler param to SGDClassifier. #5272

Closed

rth mentioned this issue Jun 7, 2019

Normalize only applies if fit_intercept=True #3020

Closed

amueller closed this as completed Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scaling to SGDClassifier #5248

Add scaling to SGDClassifier #5248

Add scaling to SGDClassifier #5248

Add scaling to SGDClassifier #5248

Comments