-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Adding Implementation of SAG #3814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is going to fail all tests, but it is a proof of concept for a sag implementation because I'd like to hear feedback on the correctness of the implementation. In the
|
…his will fail all tests because it hard codes computation of sag, but this is a proof of concept
I just pushed code that should pass except for the doctest. It allows users to toggle sag using a |
What I just committed is an external SAG class, that randomly samples X. I've added an example that compares that implementation of sag with the other linear classifiers |
|
||
|
||
class BaseSAGClassifier(BaseSGDClassifier): | ||
def __init__(self, loss="hinge", penalty='l2', alpha=0.0001, l1_ratio=0.15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SAG classifier should only support log loss and no l1_ratio (only pure L2 regularization).
I suspect the changes to cython of sgd here should be reverted. |
@agramfort What loss should the SAGRegressor support? Should a SAGRegressor even be implemented? |
Use squared loss for regression |
I just added the regressor for sag. I also added an example for the regressor in plot_sag_comparison.py. Obviously I'll need to add documentation and some of the checks that appear in stochastic_gradient_descent. If the implementation looks alright, though, I can start putting the implementation into a sag_fast.pyx file and include sparse tricks to speed it up. |
classifiers = [ | ||
("SGD", SGDClassifier(eta0=.001, learning_rate="constant", n_iter=20)), | ||
# ("ASGD", SGDClassifier(average=True)), | ||
("SAG", SAGClassifier(eta0=.001, n_iter=20)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just compare SAG with SGD
can you share a plot?
…parse trick still needed
…arse implementation is used
Here are graphs of the loss and score of SAG compared to all solvers of LogisiticRegression including |
beautiful ! @amueller did we build a proper case? |
Awesome! |
With respect to LogisticRegression, I think there might be a slight improvement if |
I think we should rename this class to |
Could you please also benchmark on other datasets, e.g. covertype and 20 newsgroups (with bag of words features) and also compare to SGDClassifier (with the same loss / penalty)? |
I think this looks promising. I would really love to see a comparison against the new version from this nips, SAGA (I think that was the one). It should be a one line modification and looked pretty good. |
One fundamental question: do we want separate classes for separate optimization schemes? From a user perspective, I feel that is not great. In many cases we have an Also, the naming scheme is not obvious. the Not sure this is a can of worms we want to open now, but SAGClassifier is better than LibLinear in a variety of settings, why not make it more discoverable by being an option in |
The current solution is nice if we want to support many different losses. However, I agree with @amueller that typical scikit-learn users just want to fit a logistic regression and don't care how. We should have an "auto" mode which chooses the best solver for them. The only problem is that some constructor parameters only make sense for certain solvers. |
liblinear might be a bit slow on dense data due to the conversion to sparse format. lightning's |
@ogrisel I'll plot some bench marks shortly using the covertype and 20newsgroup shortly. I'll also get rid of these merge conflicts. @amueller I was looking at SAGA which seems like a pretty simple implementation, but getting it to work with the sparse implementation might be a little bit trickier, so I'll have to look into it further. |
+1. I also think we should rather add I think it's fine to have some |
SAG has also a step size parameter. There is a default value but you can |
But if we start tuning the step size, we need to take into account the cross-validation time in order for the comparison to be fair with other solvers. So I'd rather use the default step size. The journal version of the SAG paper describes a line search for automatically finding a possibly better step size. Could be worth exploring (IIRC, this is also what they used in their experiments). |
ok. Then we have no param.
ok but I would say it can be done later. Let add a "sag" solver to LogisticRegression and Ridge |
So should we change the default solver to auto? And what would auto do? Sorry I didn't look at this code really, is multinomial logistic regression the true one or ovr? |
it's binary log reg. setting "auto" why not but I suspect it's hard to set |
Any updates on this? I'm now more convinced that is would be useful ;) |
I'll see if I can do some work on this sometime this week. I think it's almost at the finish line |
I have no man power to finish this currently but I would be great if Danny
has some time.
|
Alright, I fixed the merge commits. Looking at the previous comments, it looks like the tide was shifting towards including this as a parameter to LogisticRegression and RidgeRegression instead of having it as its own classifier/solver? If that's the case, what I might do is port the fit methods of LR and RR to the existing code and remove SAG from the api. Does that sound good? |
if that does not add parameters that are irrelevant for other solvers fine
with me. Or at least no objection...
|
I'm +.5 for adding it as a solver if the options don't blow up (as @agramfort said). |
Can be closed |
An implementation of SAG for #3805
TODO list:
Make correct 'auto' eta0 implementation (what should be the Lipschitz Constant for regressor vs. classifier)Make test pass for checking infinity or NaN in weightsAdd doc string for Classifierremove dependency on BaseSGD?enable coef init and intercept initallow for warm startingfind difference between sparse and standard implementationsfigure out way to duplicate c seeding for testadd support for class weightingget doctests to passget all other tests to passmake it work if there is only one class presentadd picklingadd sparse decay