8000 [WIP] Adding Implementation of SAG by dsullivan7 · Pull Request #3814 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Adding Implementation of SAG #3814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 65 commits into from

Conversation

dsullivan7
Copy link
Contributor

An implementation of SAG for #3805
TODO list:

  • Make correct 'auto' eta0 implementation (what should be the Lipschitz Constant for regressor vs. classifier)
  • Make test pass for checking infinity or NaN in weights
  • Add doc string for Classifier
  • remove dependency on BaseSGD?
  • enable coef init and intercept init
  • allow for warm starting
  • find difference between sparse and standard implementations
  • figure out way to duplicate c seeding for test
  • add support for class weighting
  • get doctests to pass
  • get all other tests to pass
  • make it work if there is only one class present
  • add pickling
  • add sparse decay

@dsullivan7
Copy link
Contributor Author

This is going to fail all tests, but it is a proof of concept for a sag implementation because I'd like to hear feedback on the correctness of the implementation. In the test_sgd.pyclass, you'll find a naive implementation of sag which can be used to compare the scikit_implementation. The test can be run by executing:

nosetests sklearn/linear_model/tests/test_sgd.py:DenseSGDClassifierTestCase.test_sag_binary_computed_correctly

…his will fail all tests because it hard codes computation of sag, but this is a proof of concept
@dsullivan7
Copy link
Contributor Author

I just pushed code that should pass except for the doctest. It allows users to toggle sag using a sag=True in the SGD constructor.

@dsullivan7
Copy link
Contributor Author

What I just committed is an external SAG class, that randomly samples X. I've added an example that compares that implementation of sag with the other linear classifiers examples/linear_model/plot_sag_comparison.py. Note that it doesn't work for multi-class because that has yet to be implemented. I'll plot some more benchmarks and make the testing more robust. Also, I will revert the added learning_rate sag from SGDClassifier. Tagging @agramfort for feedback on the outline of the new class. Is it about what you had in mind?



class BaseSAGClassifier(BaseSGDClassifier):
def __init__(self, loss="hinge", penalty='l2', alpha=0.0001, l1_ratio=0.15,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAG classifier should only support log loss and no l1_ratio (only pure L2 regularization).

@agramfort
Copy link
Member

I suspect the changes to cython of sgd here should be reverted.

@dsullivan7
Copy link
Contributor Author

SAG classifier should only support log loss and no l1_ratio (only pure L2 regularization).

@agramfort What loss should the SAGRegressor support? Should a SAGRegressor even be implemented?

@agramfort
Copy link
Member

Use squared loss for regression

@dsullivan7
Copy link
Contributor Author

I just added the regressor for sag. I also added an example for the regressor in plot_sag_comparison.py. Obviously I'll need to add documentation and some of the checks that appear in stochastic_gradient_descent. If the implementation looks alright, though, I can start putting the implementation into a sag_fast.pyx file and include sparse tricks to speed it up.

classifiers = [
("SGD", SGDClassifier(eta0=.001, learning_rate="constant", n_iter=20)),
# ("ASGD", SGDClassifier(average=True)),
("SAG", SAGClassifier(eta0=.001, n_iter=20)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just compare SAG with SGD

can you share a plot?

@dsullivan7
Copy link
Contributor Author

Here is the plot comparing SGD with SAG for both classification and regressions. I used the digits dataset for classification and the boston dataset for regression. I used a constant learning rate for both SGD and SAG as I have yet to implement different learning rates for SAG. The score is mean squared error for regression and accuracy error for classification. You can see for both cases and for all percentages omitted, sag out performs sgd.
Regression:
screen shot 2014-11-06 at 7 12 09 pm
Classification:
screen shot 2014-11-06 at 7 16 33 pm

@dsullivan7
Copy link
Contributor Author

Here are graphs of the loss and score of SAG compared to all solvers of LogisiticRegression including "newton-cg", "liblinear", and "lbfgs". You can see that SAG quickly converges to the optimum. The LogisticRegression solvers all converge at roughly the same speed for this dataset. The code I used to create these graphs is available here: https://github.com/dsullivan7/sag_test.

score:
screen shot 2014-12-18 at 10 52 47 am

loss:
screen shot 2014-12-18 at 10 53 32 am

@agramfort
Copy link
Member

beautiful !

@amueller did we build a proper case?

@MechCoder
Copy link
Member

Awesome!

@MechCoder
Copy link
Member

With respect to LogisticRegression, I think there might be a slight improvement if fit_intercept is set to False, but that is beyond the point.

@ogrisel
Copy link
Member
ogrisel commented Dec 30, 2014

I think we should rename this class to SAGLinearClassifier while it's not to late to avoid making the same mistake as with SGDClassifier / SGDRegressor.

@ogrisel
Copy link
Member
ogrisel commented Dec 30, 2014

Could you please also benchmark on other datasets, e.g. covertype and 20 newsgroups (with bag of words features) and also compare to SGDClassifier (with the same loss / penalty)?

@amueller
Copy link
Member

I think this looks promising. I would really love to see a comparison against the new version from this nips, SAGA (I think that was the one). It should be a one line modification and looked pretty good.

@amueller
Copy link
Member

One fundamental question: do we want separate classes for separate optimization schemes? From a user perspective, I feel that is not great. In many cases we have an algorithm parameter or similar.
If there are many tuning knobs per algorithm, it might make sense to separate them, like we did for the liblinear based vs the SGD based. From a usability point of view this seems questionable.

Also, the naming scheme is not obvious. the LogisticRegression is liblinear (not really logistic regression) or LBFG-S. Why SAG (or SGD) are not included is not super clear.

Not sure this is a can of worms we want to open now, but SAGClassifier is better than LibLinear in a variety of settings, why not make it more discoverable by being an option in LogisticRegression?

@mblondel
Copy link
Member
mblondel commented Jan 3, 2015

The current solution is nice if we want to support many different losses. However, I agree with @amueller that typical scikit-learn users just want to fit a logistic regression and don't care how. We should have an "auto" mode which chooses the best solver for them. The only problem is that some constructor parameters only make sense for certain solvers.

@mblondel
Copy link
Member
mblondel commented Jan 3, 2015

liblinear might be a bit slow on dense data due to the conversion to sparse format. lightning's LinearSVC supports dense format natively and might be a bit faster.

@dsullivan7
Copy link
Contributor Author

@ogrisel I'll plot some bench marks shortly using the covertype and 20newsgroup shortly. I'll also get rid of these merge conflicts.

@amueller I was looking at SAGA which seems like a pretty simple implementation, but getting it to work with the sparse implementation might be a little bit trickier, so I'll have to look into it further.

@ogrisel
Copy link
Member
ogrisel commented Jan 3, 2015

The current solution is nice if we want to support many different losses. However, I agree with @amueller that typical scikit-learn users just want to fit a logistic regression and don't care how. We should have an "auto" mode which chooses the best solver for them. The only problem is that some constructor parameters only make sense for certain solvers.

+1. I also think we should rather add solver="sag" to sklearn.linear_model.LogisticRegression and sklearn.svm.LinearSVC and sklearn.svm.LinearSVR rather than introducing new classes.

I think it's fine to have some penalty and loss parameters that are only available for some solvers and raise ValueError with an explicit error message otherwise as long as the default values for loss and penalty work for all solvers (e.g. penalty='l2').

@agramfort
Copy link
Member

SAG has also a step size parameter. There is a default value but you can
gain a lot by changing is. So it adds a parameter to all solvers unless we
don't let people change it. So I am +0.314159 to add it to logistic
regression.

@mblondel
Copy link
Member
mblondel commented Jan 4, 2015

But if we start tuning the step size, we need to take into account the cross-validation time in order for the comparison to be fair with other solvers. So I'd rather use the default step size. The journal version of the SAG paper describes a line search for automatically finding a possibly better step size. Could be worth exploring (IIRC, this is also what they used in their experiments).

@agramfort
Copy link
Member

But if we start tuning the step size, we need to take into account the cross-validation time in order for the comparison to be fair with other solvers. So I'd rather use the default step size.

ok. Then we have no param.

The journal version of the SAG paper describes a line search for automatically finding a possibly better step size. Could be worth exploring (IIRC, this is also what they used in their experiments).

ok but I would say it can be done later.

Let add a "sag" solver to LogisticRegression and Ridge

@amueller
Copy link
Member
amueller commented Jan 6, 2015

So should we change the default solver to auto? And what would auto do? Sorry I didn't look at this code really, is multinomial logistic regression the true one or ovr?
(I would prefer the default sklearn one to be the "true" one but that is a backward incompatible change :-/ )

@agramfort
Copy link
Member

it's binary log reg. setting "auto" why not but I suspect it's hard to set
correctly across datasets. knowing n_samples and n_features will not be
enough

@amueller
Copy link
Member
amueller commented Apr 6, 2015

Any updates on this? I'm now more convinced that is would be useful ;)

@dsullivan7
Copy link
Contributor Author

I'll see if I can do some work on this sometime this week. I think it's almost at the finish line

@agramfort
Copy link
Member
agramfort commented Apr 6, 2015 via email

@dsullivan7
Copy link
Contributor Author

Alright, I fixed the merge commits. Looking at the previous comments, it looks like the tide was shifting towards including this as a parameter to LogisticRegression and RidgeRegression instead of having it as its own classifier/solver? If that's the case, what I might do is port the fit methods of LR and RR to the existing code and remove SAG from the api. Does that sound good?

@agramfort
Copy link
Member
agramfort commented Apr 8, 2015 via email

@amueller
Copy link
Member
amueller commented Apr 8, 2015

I'm +.5 for adding it as a solver if the options don't blow up (as @agramfort said).

@TomDLT
Copy link
Member
TomDLT commented Sep 11, 2015

Can be closed

@agramfort agramfort closed this Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0