10000 FIX LogisticRegression's handling of the `tol` parameter with `solver="lbfgs"` by ogrisel · Pull Request #27191 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX LogisticRegression's handling of the tol parameter with solver="lbfgs" #27191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ogrisel
Copy link
Member
@ogrisel ogrisel commented Aug 28, 2023

I believe this fixes #18074.

This is a draft fix to set ftol while preserving the default behavior.

This PR is still a draft, here are some TODOs:

Note: this PR does not investigate the potential problem of scaling of the penalization term (#24752) but is probably a prerequisite to be able to conduct proper benchmarking with varying tol values.

@github-actions
Copy link
github-actions bot commented Aug 28, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 01673c8. Link to the linter CI: here

@ogrisel
Copy link
Member Author
ogrisel commented Aug 28, 2023

In particular with this PR, we get the expected behavior for the reproducer of #18074 (adapted below to explore even lower tol values):

In [1]: import numpy as np
   ...: 
   ...: n_features = 1000
   ...: n_examples = 1500
   ...: 
   ...: np.random.seed(0)
   ...: x = np.random.random((n_examples, n_features))
   ...: y = np.random.randint(2, size=n_examples)
   ...: max_iter=10000
   ...: solver = 'lbfgs'
   ...: 
   ...: for tol in [1e-2, 1e-4, 1e-6, 1e-8, 1e-10, 1e-12, 1e-14, 0]:
   ...:   np.random.seed(0)
   ...:   lr1 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x, y)
   ...: 
   ...:   np.random.seed(0)
   ...:   lr2 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x[::-1], y[::-1])
   ...: 
   ...:   print(f'tol={tol}')
   ...:   print(f'  Optimizer iterations, forward order: {lr1.n_iter_[0]}, reverse order: {lr2.n_iter_[0]}.')
   ...:   print(f'  Mean absolute diff in coefficients: {np.abs(lr1.coef_ - lr2.coef_).mean()}')
   ...: 
tol=0.01
  Optimizer iterations, forward order: 155, reverse order: 179.
  Mean absolute diff in coefficients: 0.001510650487731272
tol=0.0001
  Optimizer iterations, forward order: 392, reverse order: 429.
  Mean absolute diff in coefficients: 0.0002033278006014194
tol=1e-06
  Optimizer iterations, forward order: 495, reverse order: 484.
  Mean absolute diff in coefficients: 4.46362869259696e-05
tol=1e-08
  Optimizer iterations, forward order: 614, reverse order: 544.
  Mean absolute diff in coefficients: 3.279931927560337e-06
tol=1e-10
  Optimizer iterations, forward order: 687, reverse order: 555.
  Mean absolute diff in coefficients: 1.1604508054265435e-06
tol=1e-12
  Optimizer iterations, forward order: 687, reverse order: 593.
  Mean absolute diff in coefficients: 7.891610079340727e-07
tol=1e-14
  Optimizer iterations, forward order: 687, reverse order: 593.
  Mean absolute diff in coefficients: 7.891610079340727e-07
tol=0
  Optimizer iterations, forward order: 687, reverse order: 593.
  Mean absolute diff in coefficients: 7.891610079340727e-07
CPU times: user 14 s, sys: 1.15 s, total: 15.2 s
Wall time: 6.14 s

Note that the number of iterations keeps increasing when decreasing tol, up to a value of tol a few order of magnitude larger than machine precision.

I still find the difference in coef space to be surprisingly large, but at least it's closer to the behavior of the other solvers.

EDIT: I updated this comment to reflect the new strategy implemented in b6b52a5 but the results are very similar.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 29, 2023

Here are the convergence plots for b6b52a5 for the script of #24752:

  • per elapsed time:

lbfgs_wall_time

  • per iteration:

lbfgs_per_iter

Note than when comparing against lbfgs2 (_GeneralizedLinearRegressor + HalfBinomialLoss), the gtol/ftol strategy implemented in this PR seems to be able to reach higher precision models for the smallest values of tol (tol=1e-10): the suboptimality of the best lbfgs model is below 1e-11 while this is not the case for _GeneralizedLinearRegressor + HalfBinomialLoss.

Also note that the the lbfgs curve is shifted to the right of the lbfgs2 curve in terms of wall clock time but they approximately overlap in terms of per iteration convergence: maybe there is extra overhead in the LogisticRegression class compared to the _GeneralizedLinearRegressor counter part. This would require confirmation with a profiler.

EDIT: also note that the newton-cholesky / lbfgs2 methods rely on (_GeneralizedLinearRegressor + HalfBinomialLoss) while lbfgs and newton-cg rely on the LogisticRegression class: as discussed in #24752 those two approaches do not use the same scale for l2 regularization. I would rather to keep this as a separate issue outside of the scope of this PR. On the above plot the penalty alpha is very low 1e-12 but still non-zero which means that the lowest suboptimality values can be a bit misleading.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 29, 2023

In _GeneralizedLinearRegressor the lbfgs solver uses a fixed small value for ftol = 64 * np.finfo(np.float64).eps:

opt_res = scipy.optimize.minimize(
func,
coef,
method="L-BFGS-B",
jac=True,
options={
"maxiter": self.max_iter,
"maxls": 50, # default is 20
"iprint": self.verbose - 1,
"gtol": self.tol,
# The constant 64 was found empirically to pass the test suite.
# The point is that ftol is very small, but a bit larger than
# machine precision for float64, which is the dtype used by lbfgs.
"ftol": 64 * np.finfo(float).eps,
},
args=(X, y, sample_weight, l2_reg_strength, n_threads),
)

This is a low value but still larger than the minimum value recommended by the scipy documentation (ftol = 10 * np.finfo(np.float64).eps). Here are the links to the relevant parts of scipy's documentation:

https://github.com/scipy/scipy/blob/e7542f30e3a9bc37cae16fbd3fbd534e0dce20b4/scipy/optimize/_lbfgsb_py.py#L267-L271

The option ftol is exposed via the scipy.optimize.minimize interface,
but calling scipy.optimize.fmin_l_bfgs_b directly exposes factr. The
relationship between the two is ftol = factr * numpy.finfo(float).eps.
I.e., factr multiplies the default machine floating-point precision to
arrive at ftol.

https://github.com/scipy/scipy/blob/c3ec353c503d23c1357e78ebff750755b3def8c6/scipy/optimize/_lbfgsb_py.py#L82-L90

The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps,
where eps is the machine precision, which is automatically
generated by the code. Typical values for factr are: 1e12 for
low accuracy; 1e7 for moderate accuracy; 10.0 for extremely
high accuracy. See Notes for relationship to ftol, which is exposed
(instead of factr) by the scipy.optimize.minimize interface to
L-BFGS-B.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 29, 2023

We might want to get the two uses of the the lbfgs solver in scikit-learn linear models better in line. However:

  • the strategy implemented in this PR seems to give a good control to the users, while making it possible to reach higher precision models when setting tol (very close to) 0.0,
  • we might not want to change the default behavior of LogisticRegression (with tol=1e-4) to avoid annoying users for who the current code is fine.

As a result, it might be a good idea to reuse the interpolation strategy implemented in this PR as part of _GeneralizedLinearRegressor in a follow-up PR.

)


def test_logistic_regression_solvers_multiclass():
@pytest.mark.parametrize("C", [1e-6, 1, 1e6])
def test_logistic_regression_solvers_multinomial(C, global_random_seed):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I decided to repurpose this test to focus on the multinomial loss instead of the OVR loss. The latter is just a bunch of binary classification problems: this is redundant with test_logistic_regression_solvers_binary.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 29, 2023

I refactored test_logistic_regression_solvers_* tests to make them stronger by using lower tol values and stricter atol checks.

While doing so I found some problems that are not directly related to the purpose of this PR. I marked them with XXX comments for later investigation. I plan to open dedicated issues later.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 29, 2023

I pushed an additional non-regression test. It is not as comprehensive as the snippet used in #27191 (comment) originally proposed in the bug report (#18074). The reason is that I am not sure what would be an expected uppoer bound for the absolute difference in coef between the 2 models with reordered features.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 29, 2023

I pushed an additional non-regression test. It is not as comprehensive as the snippet used in #27191 (comment) originally proposed by @geraschenko in the bug report (#18074). The reason is that I am not sure what would be an expected upper bound for the absolute difference in coef between the 2 models with reordered features.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 30, 2023

By analysing the differences with _GeneralizedLinearRegressor I found more details that might matter. Here is a summary:

  • the setting of ftol;
  • different values for the maximum number of line search steps (20 vs 50);
  • more importantly, the initialization of the intercept parameter: LinearRegression uses 0 which makes it impossible to use a fixed ftol as GLMR does. If we were to initalize the intercept smartly as done in the GLMR class, then we could probable set a fixed small ftol all the time and mostly rely on gtol to cleanly terminate lfbgs in most situations.

So I propose to keep this PR as draft until I can find the time to conduct a deeper study with several datasets to estimate the relative robustness of either init / optimizer settings.

@lorentzenchr
Copy link
Member

@ogrisel Thanks for your analysis. While the plots and tests show the impovingvimpact of this PR, I‘m still not convinced. It shows that ftol terminated the solver instead of gtol which is preferred. I still think correctly scaling the objective (and grad) is the cleaner solution.

@ogrisel
Copy link
Member Author
ogrisel commented Aug 30, 2023

I agree but i think the main ingredient to make the gtol based stopping condition work is proper init of the intercept.

@lorentzenchr
Copy link
Member

I agree but i think the main ingredient to make the gtol based stopping condition work is proper init of the intercept.

I don’t follow. How are start point of the intercept an gtol connected?

@ogrisel
Copy link
Member Author
ogrisel commented Sep 1, 2023

I don’t follow. How are start point of the intercept an gtol connected?

I suspect the intercept-related component of the gradient can dominate to max(abs(gradient)) computation in that case, making preventing the gtol criterion to trigger before the ftol criterion when both are set to be proportional to ftol. I would need run experiments to confirm this.

@lorentzenchr
Copy link
Member

I'd say #26721 solved the issue.

@ogrisel ogrisel deleted the logistic-regression-lbfgs-tol branch October 24, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LogisticRegression with lbfgs solver terminates before convergence
2 participants
0