-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX LogisticRegression's handling of the tol
parameter with solver="lbfgs"
#27191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In particular with this PR, we get the expected behavior for the reproducer of #18074 (adapted below to explore even lower In [1]: import numpy as np
...:
...: n_features = 1000
...: n_examples = 1500
...:
...: np.random.seed(0)
...: x = np.random.random((n_examples, n_features))
...: y = np.random.randint(2, size=n_examples)
...: max_iter=10000
...: solver = 'lbfgs'
...:
...: for tol in [1e-2, 1e-4, 1e-6, 1e-8, 1e-10, 1e-12, 1e-14, 0]:
...: np.random.seed(0)
...: lr1 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x, y)
...:
...: np.random.seed(0)
...: lr2 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x[::-1], y[::-1])
...:
...: print(f'tol={tol}')
...: print(f' Optimizer iterations, forward order: {lr1.n_iter_[0]}, reverse order: {lr2.n_iter_[0]}.')
...: print(f' Mean absolute diff in coefficients: {np.abs(lr1.coef_ - lr2.coef_).mean()}')
...:
tol=0.01
Optimizer iterations, forward order: 155, reverse order: 179.
Mean absolute diff in coefficients: 0.001510650487731272
tol=0.0001
Optimizer iterations, forward order: 392, reverse order: 429.
Mean absolute diff in coefficients: 0.0002033278006014194
tol=1e-06
Optimizer iterations, forward order: 495, reverse order: 484.
Mean absolute diff in coefficients: 4.46362869259696e-05
tol=1e-08
Optimizer iterations, forward order: 614, reverse order: 544.
Mean absolute diff in coefficients: 3.279931927560337e-06
tol=1e-10
Optimizer iterations, forward order: 687, reverse order: 555.
Mean absolute diff in coefficients: 1.1604508054265435e-06
tol=1e-12
Optimizer iterations, forward order: 687, reverse order: 593.
Mean absolute diff in coefficients: 7.891610079340727e-07
tol=1e-14
Optimizer iterations, forward order: 687, reverse order: 593.
Mean absolute diff in coefficients: 7.891610079340727e-07
tol=0
Optimizer iterations, forward order: 687, reverse order: 593.
Mean absolute diff in coefficients: 7.891610079340727e-07
CPU times: user 14 s, sys: 1.15 s, total: 15.2 s
Wall time: 6.14 s Note that the number of iterations keeps increasing when decreasing I still find the difference in coef space to be surprisingly large, but at least it's closer to the behavior of the other solvers. EDIT: I updated this comment to reflect the new strategy implemented in b6b52a5 but the results are very similar. |
Here are the convergence plots for b6b52a5 for the script of #24752:
Note than when comparing against lbfgs2 ( Also note that the the lbfgs curve is shifted to the right of the lbfgs2 curve in terms of wall clock time but they approximately overlap in terms of per iteration convergence: maybe there is extra overhead in the EDIT: also note that the |
In scikit-learn/sklearn/linear_model/_glm/glm.py Lines 265 to 281 in 8357243
This is a low value but still larger than the minimum value recommended by the scipy documentation (
|
We might want to get the two uses of the the lbfgs solver in scikit-learn linear models better in line. However:
As a result, it might be a good idea to reuse the interpolation strategy implemented in this PR as part of |
) | ||
|
||
|
||
def test_logistic_regression_solvers_multiclass(): | ||
@pytest.mark.parametrize("C", [1e-6, 1, 1e6]) | ||
def test_logistic_regression_solvers_multinomial(C, global_random_seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I decided to repurpose this test to focus on the multinomial loss instead of the OVR loss. The latter is just a bunch of binary classification problems: this is redundant with test_logistic_regression_solvers_binary
.
I refactored While doing so I found some problems that are not directly related to the purpose of this PR. I marked them with |
I pushed an additional non-regression test. It is not as comprehensive as the snippet used in #27191 (comment) originally proposed in the bug report (#18074). The reason is that I am not sure what would be an expected uppoer bound for the absolute difference in coef between the 2 models with reordered features. |
I pushed an additional non-regression test. It is not as comprehensive as the snippet used in #27191 (comment) originally proposed by @geraschenko in the bug report (#18074). The reason is that I am not sure what would be an expected upper bound for the absolute difference in coef between the 2 models with reordered features. |
By analysing the differences with
So I propose to keep this PR as draft until I can find the time to conduct a deeper study with several datasets to estimate the relative robustness of either init / optimizer settings. |
@ogrisel Thanks for your analysis. While the plots and tests show the impovingvimpact of this PR, I‘m still not convinced. It shows that ftol terminated the solver instead of gtol which is preferred. I still think correctly scaling the objective (and grad) is the cleaner solution. |
I agree but i think the main ingredient to make the gtol based stopping condition work is proper init of the intercept. |
I don’t follow. How are start point of the intercept an gtol connected? |
I suspect the intercept-related component of the gradient can dominate to |
I'd say #26721 solved the issue. |
I believe this fixes #18074.
This is a draft fix to set
ftol
while preserving the default behavior.This PR is still a draft, here are some TODOs:
gtol
andftol
is a good strategy: in particular does it cause problem for very low values oftol
? It does not seem to be the case from the experiment in the first comment of this PR._GeneralizedLinearRegressor
: FIX LogisticRegression's handling of thetol
parameter withsolver="lbfgs"
#27191 (comment)tol
parameter withsolver="lbfgs"
#27191 (comment)Note: this PR does not investigate the potential problem of scaling of the penalization term (#24752) but is probably a prerequisite to be able to conduct proper benchmarking with varying
tol
values.