You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Different choices of solver for sklearn's LogisticRegression optimize different cost functions - its highly confusing behavior, particularly of concern if you want to publish what cost function you're using. In particular:
sklearn.LR(solver=liblinear) minimizes: L + lam*Rb
sklearn.LR(solver=others) minimizes: L + lam*R
statsmodels.GLM(bionomial) minimizes: L/n + lam*Rb
where:
lam = 1/C
L = logloss
n = training sample size
R = square of L2 norm of feature weights
Rb =square of L2 norm of feature weights and intercept
I was a little surprised to find that the logloss is not normalized by the training set size. I think this is uncommon, and means the effective C changes based on the amount of training data. Good thing, bad thing? Not sure, but it seems unusual, but more importantly, what is minimized should be explicit.
PS. #10001 --- excellent idea! The default liblinear cost function is just plain confusing.
Steps/Code to Reproduce
There's an example to show the different weights here:
The text was updated successfully, but these errors were encountered:
allComputableThings
changed the title
Document what cost function that LogisticRegression minimizes for each solver
Document what cost function LogisticRegression minimizes for each solver
Nov 17, 2017
PR welcome!
Yes, this is very confusing indeed. Thank you for your summary. I think having a section in the user guide committed to it would help. The not scaling the regularization is basically for consistency with linearSVC (which is also implemented in liblinear). But it's also the parametrization that "makes sense" IIRC. That means that this is the parametrization that is least dependent on the number of samples. We did extensive experiments a couple of years ago, and from what I remember, we found that for L2 not scaling by n_samples is the right way to go, and for L1 scaling by n_samples is. We have the same scaling behavior in Ridge btw.
Thanks for pulling out the example. The thing that the docs don't mention is the difference in penalizing the intercept, right? Otherwise they seem pretty complete.
Uh oh!
There was an error while loading. Please reload this page.
Description
Different choices of solver for sklearn's LogisticRegression optimize different cost functions - its highly confusing behavior, particularly of concern if you want to publish what cost function you're using. In particular:
where:
I was a little surprised to find that the logloss is not normalized by the training set size. I think this is uncommon, and means the effective C changes based on the amount of training data. Good thing, bad thing? Not sure, but it seems unusual, but more importantly, what is minimized should be explicit.
PS. #10001 --- excellent idea! The default liblinear cost function is just plain confusing.
Steps/Code to Reproduce
There's an example to show the different weights here:
https://stackoverflow.com/questions/47338695/why-does-the-choice-of-solver-result-in-different-weight-in-sklearn-logisticreg
Expected Results
Actual Results
Versions
1.19.1
The text was updated successfully, but these errors were encountered: