FIX LogisticRegression's handling of the `tol` parameter with `solver="lbfgs"` #27191

ogrisel · 2023-08-28T17:01:22Z

I believe this fixes #18074.

This is a draft fix to set ftol while preserving the default behavior.

This PR is still a draft, here are some TODOs:

Inspect a full CI run to make sure that this fix does not break existing tests, or at least not for a good reason.
Check whether the linear scale between gtol and ftol is a good strategy: in particular does it cause problem for very low values of tol? It does not seem to be the case from the experiment in the first comment of this PR.
In particular: review the difference with the stopping condition implemented for lbfgs in _GeneralizedLinearRegressor: FIX LogisticRegression's handling of the tol parameter with solver="lbfgs" #27191 (comment)
Plot objective function value for different values of tol with lbfgs against alternative solvers: FIX LogisticRegression's handling of the tol parameter with solver="lbfgs" #27191 (comment)
Add a non-regression test for LogisticRegression with lbfgs solver terminates before convergence #18074.
Add an entry to the changelog.

Note: this PR does not investigate the potential problem of scaling of the penalization term (#24752) but is probably a prerequisite to be able to conduct proper benchmarking with varying tol values.

…ault value

github-actions · 2023-08-28T17:03:15Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 01673c8. Link to the linter CI: here}

ogrisel · 2023-08-28T17:04:01Z

In particular with this PR, we get the expected behavior for the reproducer of #18074 (adapted below to explore even lower tol values):

In [1]: import numpy as np
   ...: 
   ...: n_features = 1000
   ...: n_examples = 1500
   ...: 
   ...: np.random.seed(0)
   ...: x = np.random.random((n_examples, n_features))
   ...: y = np.random.randint(2, size=n_examples)
   ...: max_iter=10000
   ...: solver = 'lbfgs'
   ...: 
   ...: for tol in [1e-2, 1e-4, 1e-6, 1e-8, 1e-10, 1e-12, 1e-14, 0]:
   ...:   np.random.seed(0)
   ...:   lr1 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x, y)
   ...: 
   ...:   np.random.seed(0)
   ...:   lr2 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x[::-1], y[::-1])
   ...: 
   ...:   print(f'tol={tol}')
   ...:   print(f'  Optimizer iterations, forward order: {lr1.n_iter_[0]}, reverse order: {lr2.n_iter_[0]}.')
   ...:   print(f'  Mean absolute diff in coefficients: {np.abs(lr1.coef_ - lr2.coef_).mean()}')
   ...: 
tol=0.01
  Optimizer iterations, forward order: 155, reverse order: 179.
  Mean absolute diff in coefficients: 0.001510650487731272
tol=0.0001
  Optimizer iterations, forward order: 392, reverse order: 429.
  Mean absolute diff in coefficients: 0.0002033278006014194
tol=1e-06
  Optimizer iterations, forward order: 495, reverse order: 484.
  Mean absolute diff in coefficients: 4.46362869259696e-05
tol=1e-08
  Optimizer iterations, forward order: 614, reverse order: 544.
  Mean absolute diff in coefficients: 3.279931927560337e-06
tol=1e-10
  Optimizer iterations, forward order: 687, reverse order: 555.
  Mean absolute diff in coefficients: 1.1604508054265435e-06
tol=1e-12
  Optimizer iterations, forward order: 687, reverse order: 593.
  Mean absolute diff in coefficients: 7.891610079340727e-07
tol=1e-14
  Optimizer iterations, forward order: 687, reverse order: 593.
  Mean absolute diff in coefficients: 7.891610079340727e-07
tol=0
  Optimizer iterations, forward order: 687, reverse order: 593.
  Mean absolute diff in coefficients: 7.891610079340727e-07
CPU times: user 14 s, sys: 1.15 s, total: 15.2 s
Wall time: 6.14 s

Note that the number of iterations keeps increasing when decreasing tol, up to a value of tol a few order of magnitude larger than machine precision.

I still find the difference in coef space to be surprisingly large, but at least it's closer to the behavior of the other solvers.

EDIT: I updated this comment to reflect the new strategy implemented in b6b52a5 but the results are very similar.

ogrisel · 2023-08-29T08:33:31Z

Here are the convergence plots for b6b52a5 for the script of #24752:

per elapsed time:

per iteration:

Note than when comparing against lbfgs2 (_GeneralizedLinearRegressor + HalfBinomialLoss), the gtol/ftol strategy implemented in this PR seems to be able to reach higher precision models for the smallest values of tol (tol=1e-10): the suboptimality of the best lbfgs model is below 1e-11 while this is not the case for _GeneralizedLinearRegressor + HalfBinomialLoss.

Also note that the the lbfgs curve is shifted to the right of the lbfgs2 curve in terms of wall clock time but they approximately overlap in terms of per iteration convergence: maybe there is extra overhead in the LogisticRegression class compared to the _GeneralizedLinearRegressor counter part. This would require confirmation with a profiler.

EDIT: also note that the newton-cholesky / lbfgs2 methods rely on (_GeneralizedLinearRegressor + HalfBinomialLoss) while lbfgs and newton-cg rely on the LogisticRegression class: as discussed in #24752 those two approaches do not use the same scale for l2 regularization. I would rather to keep this as a separate issue outside of the scope of this PR. On the above plot the penalty alpha is very low 1e-12 but still non-zero which means that the lowest suboptimality values can be a bit misleading.

ogrisel · 2023-08-29T09:17:40Z

In _GeneralizedLinearRegressor the lbfgs solver uses a fixed small value for ftol = 64 * np.finfo(np.float64).eps:

scikit-learn/sklearn/linear_model/_glm/glm.py

Lines 265 to 281 in 8357243

    
           opt_res = scipy.optimize.minimize( 
        
               func, 
        
               coef, 
        
               method="L-BFGS-B", 
        
               jac=True, 
        
               options={ 
        
                   "maxiter": self.max_iter, 
        
                   "maxls": 50,  # default is 20 
        
                   "iprint": self.verbose - 1, 
        
                   "gtol": self.tol, 
        
                   # The constant 64 was found empirically to pass the test suite. 
        
                   # The point is that ftol is very small, but a bit larger than 
        
                   # machine precision for float64, which is the dtype used by lbfgs. 
        
                   "ftol": 64 * np.finfo(float).eps, 
        
               }, 
        
               args=(X, y, sample_weight, l2_reg_strength, n_threads), 
        
           )

This is a low value but still larger than the minimum value recommended by the scipy documentation (ftol = 10 * np.finfo(np.float64).eps). Here are the links to the relevant parts of scipy's documentation:

https://github.com/scipy/scipy/blob/e7542f30e3a9bc37cae16fbd3fbd534e0dce20b4/scipy/optimize/_lbfgsb_py.py#L267-L271

The option ftol is exposed via the scipy.optimize.minimize interface,
but calling scipy.optimize.fmin_l_bfgs_b directly exposes factr. The
relationship between the two is ftol = factr * numpy.finfo(float).eps.
I.e., factr multiplies the default machine floating-point precision to
arrive at ftol.

https://github.com/scipy/scipy/blob/c3ec353c503d23c1357e78ebff750755b3def8c6/scipy/optimize/_lbfgsb_py.py#L82-L90

The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps,
where eps is the machine precision, which is automatically
generated by the code. Typical values for factr are: 1e12 for
low accuracy; 1e7 for moderate accuracy; 10.0 for extremely
high accuracy. See Notes for relationship to ftol, which is exposed
(instead of factr) by the scipy.optimize.minimize interface to
L-BFGS-B.

ogrisel · 2023-08-29T09:28:11Z

We might want to get the two uses of the the lbfgs solver in scikit-learn linear models better in line. However:

the strategy implemented in this PR seems to give a good control to the users, while making it possible to reach higher precision models when setting tol (very close to) 0.0,
we might not want to change the default behavior of LogisticRegression (with tol=1e-4) to avoid annoying users for who the current code is fine.

As a result, it might be a good idea to reuse the interpolation strategy implemented in this PR as part of _GeneralizedLinearRegressor in a follow-up PR.

ogrisel · 2023-08-29T14:51:55Z

sklearn/linear_model/tests/test_logistic.py

        )


-def test_logistic_regression_solvers_multiclass():
+@pytest.mark.parametrize("C", [1e-6, 1, 1e6])
+def test_logistic_regression_solvers_multinomial(C, global_random_seed):


Note: I decided to repurpose this test to focus on the multinomial loss instead of the OVR loss. The latter is just a bunch of binary classification problems: this is redundant with test_logistic_regression_solvers_binary.

ogrisel · 2023-08-29T14:54:32Z

I refactored test_logistic_regression_solvers_* tests to make them stronger by using lower tol values and stricter atol checks.

While doing so I found some problems that are not directly related to the purpose of this PR. I marked them with XXX comments for later investigation. I plan to open dedicated issues later.

…rations for all solvers

ogrisel · 2023-08-29T17:07:14Z

I pushed an additional non-regression test. It is not as comprehensive as the snippet used in #27191 (comment) originally proposed in the bug report (#18074). The reason is that I am not sure what would be an expected uppoer bound for the absolute difference in coef between the 2 models with reordered features.

ogrisel · 2023-08-29T17:07:23Z

I pushed an additional non-regression test. It is not as comprehensive as the snippet used in #27191 (comment) originally proposed by @geraschenko in the bug report (#18074). The reason is that I am not sure what would be an expected upper bound for the absolute difference in coef between the 2 models with reordered features.

ogrisel · 2023-08-30T07:27:51Z

By analysing the differences with _GeneralizedLinearRegressor I found more details that might matter. Here is a summary:

the setting of ftol;
different values for the maximum number of line search steps (20 vs 50);
more importantly, the initialization of the intercept parameter: LinearRegression uses 0 which makes it impossible to use a fixed ftol as GLMR does. If we were to initalize the intercept smartly as done in the GLMR class, then we could probable set a fixed small ftol all the time and mostly rely on gtol to cleanly terminate lfbgs in most situations.

So I propose to keep this PR as draft until I can find the time to conduct a deeper study with several datasets to estimate the relative robustness of either init / optimizer settings.

lorentzenchr · 2023-08-30T17:34:32Z

@ogrisel Thanks for your analysis. While the plots and tests show the impovingvimpact of this PR, I‘m still not convinced. It shows that ftol terminated the solver instead of gtol which is preferred. I still think correctly scaling the objective (and grad) is the cleaner solution.

ogrisel · 2023-08-30T19:42:34Z

I agree but i think the 8000 main ingredient to make the gtol based stopping condition work is proper init of the intercept.

lorentzenchr · 2023-08-31T05:26:25Z

I agree but i think the main ingredient to make the gtol based stopping condition work is proper init of the intercept.

I don’t follow. How are start point of the intercept an gtol connected?

ogrisel · 2023-09-01T06:39:56Z

I don’t follow. How are start point of the intercept an gtol connected?

I suspect the intercept-related component of the gradient can dominate to max(abs(gradient)) computation in that case, making preventing the gtol criterion to trigger before the ftol criterion when both are set to be proportional to ftol. I would need run experiments to confirm this.

lorentzenchr · 2023-10-07T11:21:22Z

I'd say #26721 solved the issue.

Set lbfgs ftol from LogisticRegression's tol while preserving the def…

4a370b4

…ault value

github-actions bot added the module:linear_model label Aug 28, 2023

ogrisel mentioned this pull request Aug 28, 2023

LogisticRegression with lbfgs solver terminates before convergence #18074

Closed

More principled strategy to adjust ftol based on the scipy documentation

b6b52a5

TST make LogisticRegression solver test stronger

7e11f1f

ogrisel commented Aug 29, 2023

View reviewed changes

ogrisel added 3 commits August 29, 2023 16:58

cosmit

e848ea0

TST new test to check that tol decrease leads to larger number of ite…

775056f

…rations for all solvers

DOC add entry to the changelog targeting 1.3.1

01673c8

ogrisel mentioned this pull request Aug 30, 2023

Fix scaling of LogisticRegression objective for LBFGS #24752

Closed

lorentzenchr mentioned this pull request Sep 15, 2023

ENH scaling of LogisticRegression loss as 1/n * LinearModelLoss #26721

Merged

lorentzenchr closed this Oct 7, 2023

ogrisel deleted the logistic-regression-lbfgs-tol branch October 24, 2023 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX LogisticRegression's handling of the `tol` parameter with `solver="lbfgs"` #27191

FIX LogisticRegression's handling of the `tol` parameter with `solver="lbfgs"` #27191

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FIX LogisticRegression's handling of the tol parameter with solver="lbfgs" #27191

FIX LogisticRegression's handling of the tol parameter with solver="lbfgs" #27191

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FIX LogisticRegression's handling of the `tol` parameter with `solver="lbfgs"` #27191

FIX LogisticRegression's handling of the `tol` parameter with `solver="lbfgs"` #27191