Description
I did some follow-up work for the discussions on potential problems for Ridge(normalize=False)
in #19426 and #17444 in the following gist:
https://gist.github.com/ogrisel/cf7ac8bca6725ec02d4cc8d03ce096a7
This gist compares the results between a minimal implementation of ridge with LBFGS by @agramfort against scikit-learn's Ridge(normalize=False, solver="cholesky")
, Ridge(normalize=False, solver="lsqr")
, Ridge(normalize=False, solver="svd")
and other solvers.
The main conclusion is that those models seems to converge to the same solution, if we set a low value of tol
and if we are not too strict on the tolerance: see how I defined atol in the gist. I cannot get convergence to machine precision (only around 1e-6/1e-7 in float64), but maybe this is expected. I am not so sure: I tried to tweak pgtol
on top of factr
but it did not improve.
On large-ish problems, L-BFGS can be significantly faster than our the always accurate svd solver (but not always, depends on the regularization) but never faster than the lsqr
(LAPACK-based) solver. The cholesky solver (our default for dense data) is sometimes faster or slower than lsqr and L-BFGS (depending on regularization) but always faster than SVD (at least with high regularization). The sag/saga solvers are not necessarily as fast as they could be: I noticed that they do no benefit from multi-threading on a multicore machine, so that might be the reason. I have never seen then run faster than lsqr. sag/saga can be significantly faster than L-BFGS despite using fewer cores on problems with n_samples=5e3
/ n_features=1e3
and not too much regularization (alpha=1e-6).
There are still problems with normalize=True
as reported in #17444 but maybe we don't care so much because we are going to deprecate this option, see #17772 by @maikia that I plan to review soon.
Proposed plan of actions:
- proceed with the deprecation of
normalize=True
in DEP Deprecate 'normalize' in ridge models #17772 andtherefore put Ridge(normalize=True) fails check_sample_weights_invariance #17444 on hold (or less of a priority)actually no, see: [MRG] FIX sample_weight invariance for linear models #19616. - improve the ridge tests in scikit-learn by explicitly evaluating the min function for different solvers and maybe using a utility such as the
check_min
of my gist.
Tight tests (but without checking the minimum explicitly) TST tight and clean tests for Ridge #22910 - maybe add the ridge_bfgs function as a reference to compare against in our test suite? Since it does not seem to always converge to a very high quality optimum, I am not so sure of the value.
- maybe we could conduct a more systematic evaluation of various datasets with different shapes, conditioning and regularization to select a better default than "cholesky" for dense data? "cholesky" is reputed to not be numerically stable, although I did not observe problems in my gist, maybe because of regularization.
lsqr
seems to be very promising, both in terms of speed and accuracy of the solution (according to mycheck_min
thingy). - I am not sure that our default value of
tol=1e-3
is safe. It's probably very solver dependent, but I doubt that it's a good idea for the fastlsqr
solver for instance (to be benchmarked if we do the above evaluation). ENH change Ridge tol to 1e-4 #24465
Other ideas:
- maybe we should consider adding l-bfgs an option? That would require doing more evaluation on sparse data / ill-conditioned data with various values of alpha. Not sure of the benefit compared to
lsqr
orcholesky
though: they seem to always be more accurate than l-bfgs and faster (especiallylsqr
). And l-bfgs can be very slow when alpha is small. - and maybe add a QR-based solver that is supposed to be more numerically stable? or is this what LAPACK is doing internally when we use the
lsqr
solver?
Metadata
Metadata
Assignees
Type
Projects
Status