-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add scaling to SGDClassifier #5248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would scaling by default be unexpected behavior? |
Maybe add a |
It's hard to know what users expect. It changes the meaning of the regularization parameter, but not much else. For the user to provide functions or objects is a bit inconvenient. Usually we would do strings or objects, so that the most common ways would be easily accessible. |
With strings or objects (vs. a boolean), users could provide different scalers, like |
yeah that is why it would be nicer ;) |
also it's unclear what to do with sparse / dense input. Not sure how that is done in the other linear models? |
|
One more question is how this would work with |
Currently in the linear_models of scikit-learn is that the input data is normalized only if both fit_intercept and normalize is set to True. Also the normalize isn't
As for sparse_input it is just normalized now and not centered. Btw, just normalizing down by a fraction would be equivalent would be scaling up the req parameter by the same value right? (assuming just normalizing and not centering) |
Yes, they are provided in the |
I think using a |
Sure. But is there any other advantage that I'm overlooking that can be done along with a Pipeline using StandardScaler? |
No, seems like it would be mostly for convenience. |
Feel free to submit a pull request. Or if you are too busy, I can submit one and we can carry on the discussion from there. |
Created a draft PR. Will work on test cases. |
I would call scaling unexpected behavior. But I'm spoiled by |
My issue is that SGDClassifier basically doesn't work without tuning when data is not scaled properly. |
adagrad? :)
|
Well yeah we could do that too... probably should |
Do we still want this? I think using a StandardScaler in a Pipeline is cleaner, and the user should easily be able to see that the model is not giving good results by looking at some evaluation metric, however it's easy to overlook an estimator parameter.
My experience as well, in the sense that a large number of users don't realize that There are also several drawbacks,
+1 for closing this (and the associated PR) cc @amueller |
ping @amueller cf above comment #5248 (comment) and also #3020 (comment) (and the following discussion) |
@rth it's inconvenient but I'm ok to close it. You can't do partial fitting in a pipeline ;) |
SGDClassifier only really works well with scaled data. I think we should add some scaling to it by default.
The text was updated successfully, but these errors were encountered: