BUG unpenalized Ridge does not give minimum norm solution #22947

lorentzenchr · 2022-03-25T13:55:07Z

Describe the bug

As noted in #22910, Ridge(alpha=0, fit_intercept=True) does not give the minimal norm solution for wide data, i.e. n_features > n_samples.

Note that we nowhere guarantee that we provide the minimum norm solution.

Edit: Same seems to hold for LinearRegression, see #26164.

Probable Cause

For wide X, the least squares problem reads a bit different: $\mathrm{min} ||w||_2$ subject to $Xw = y$ with solution $w = X'(XX')^{-1} y$, see e.g. http://ee263.stanford.edu/lectures/min-norm.pdf.
With explicit intercept $w_0$, this reads $w = X'(XX' + 1 1')^{-1} y$, where 1 is a column vector of ones. $w_0 = 1'(XX' + 1 1')^{-1} y$.
This is incompatible with our mean centering approach.

Example

import numpy as np
from numpy.testing import assert_allclose
from scipy import linalg
from sklearn.datasets import make_low_rank_matrix
from sklearn.linear_model import Ridge

n_samples, n_features = 4, 12  # wide data
k = min(n_samples, n_features)
rng = np.random.RandomState(42)
X = make_low_rank_matrix(
    n_samples=n_samples, n_features=n_features, effective_rank=k
)
X[:, -1] = 1  # last columns acts as intercept
U, s, Vt = linalg.svd(X)
assert np.all(s) > 1e-3  # to be sure X is not singular
U1, U2 = U[:, :k], U[:, k:]
Vt1, _ = Vt[:k, :], Vt[k:, :]
y = rng.uniform(low=-10, high=10, size=n_samples)
# w = X'(XX')^-1 y = V s^-1 U' y
coef_ols = Vt1.T @ np.diag(1 / s) @ U1.T @ y

model = Ridge(alpha=0, fit_intercept=True)
X = X[:, :-1]  # remove intercept
intercept = coef_ols[-1]
coef = coef_ols[:-1]
model.fit(X, y)

# Check that we have found a solution => residuals = 0
assert_allclose(model.predict(X), y)
# Check that `coef`, `intercept` also provide a valid solution
assert_allclose(X @ coef + intercept, y)

# Ridge does not give the minimum norm solution. (This should be equal.)
np.linalg.norm(np.r_[model.intercept_, model.coef_]) > np.linalg.norm(
    np.r_[intercept, coef]
)

This last statement should be be False. It proves that Ridge does not give the miminum norm solution.

The text was updated successfully, but these errors were encountered:

ogrisel · 2022-10-12T17:01:21Z

Related: apparently LinearRegression sometimes yield the minimal norm solution rather in a platform dependent manner as seen in this feedback from the MOOC:

Exercise M4.04 demo of LinearRegression on repeated features is unstable INRIA/scikit-learn-mooc#641

So this behavior in the Ridge class might be either platform dependent or dependent on the version of scipy / LAPACK.

Ahmedbgh · 2023-01-04T14:23:35Z

I'm working on it :)

ogrisel · 2023-03-10T22:34:37Z

I think for Ridge, it's ok to offer the possibility to use solver that do not converge to the minimum solution when alpha is 0 as long as:

the docstring for the solver parameter clearly documents which solvers are (numerically) safe to use without penalization,
raise an informative ConvergenceWarning when calling Ridge with alpha < np.finfo(X.dtype).eps (or similar) and a value of solver that does not guarantee to remove the minimum norm solution.

For LinearRegression we should probably not expose solvers that are not guaranteed to converge to the minimum norm solution. This specific issue is being discussed in #22855.

ogrisel · 2023-03-11T12:20:39Z

Note that the default solver used in the snippet above will fallback to "svd" because the problem is singular.

I tried the snippet above with "lsqr" and unfortunately is does not converge to the minimum norm solution either.

As a sanity check I implemented the following naive solver:

class PInvLinearRegression(LinearRegression):

    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept

    def fit(self, X, y):
        if self.fit_intercept:
            X = np.hstack([X, np.ones((X.shape[0], 1))])
        fitted_params = np.linalg.pinv(X.T @ X) @ X.T @ y
        if self.fit_intercept:
            self.coef_ = fitted_params[:-1]
            self.intercept_ = fitted_params[-1]
        else:
            self.coef_ = fitted_params
            self.intercept_ = 0
        return self

and I confirm that this approach recovers the expected minimal norm solution successfully, so the script in the description of this issue is correct.

This pinv-based solver is not suitable for large n_features in practice. Furthermore if we where to implement it in scikit-learn (for the small n_features case) we would need to find a way to avoid the memory copy with hstack and add support for sample_weight (possibly via a linear operator?).

I suspect that to recover the minimum norm solution for iterative solvers such as "lbfgs" or "lsqr", we need to do a dump init of the all the parameters (including the intercept) at zero.

ogrisel · 2023-03-11T13:34:13Z

I suspect that to recover the minimum norm solution for iterative solvers such as "lbfgs" or "lsqr", we need to do a dump init of the all the parameters (including the intercept) at zero.

Actually, for lsqr, this is not really an init problem but more of the way we pose the problem. We preprocess the data based on fit_intercept to mean center y and then we use that assumption to simplify the handling of the sample weight:

scikit-learn/sklearn/linear_model/_base.py

Lines 276 to 277 in 30bf6f3

    
           y_offset = np.average(y, axis=0, weights=sample_weight) 
        
           y = y - y_offset

scikit-learn/sklearn/linear_model/_ridge.py

Lines 148 to 175 in 30bf6f3

    
           def _solve_lsqr( 
        
               X, 
        
               y, 
        
               *, 
        
               alpha, 
        
               fit_intercept=True, 
        
               max_iter=None, 
        
               tol=1e-4, 
        
               X_offset=None, 
        
               X_scale=None, 
        
               sample_weight_sqrt=None, 
        
           ): 
        
               """Solve Ridge regression via LSQR. 
        
               We expect that y is always mean centered. 
        
               If X is dense, we expect it to be mean centered such that we can solve 
        
                   ||y - Xw||_2^2 + alpha * ||w||_2^2 
        
               If X is sparse, we expect X_offset to be given such that we can solve 
        
                   ||y - (X - X_offset)w||_2^2 + alpha * ||w||_2^2 
        
               With sample weights S=diag(sample_weight), this becomes 
        
                   ||sqrt(S) (y - (X - X_offset) w)||_2^2 + alpha * ||w||_2^2
8000
 
               and we expect y and X to already be rescaled, i.e. sqrt(S) @ y, sqrt(S) @ X. In 
        
               this case, X_offset is the sample_weight weighted mean of X before scaling by 
        
               sqrt(S). The objective then reads 
        
                  ||y - (X - sqrt(S) X_offset) w)||_2^2 + alpha * ||w||_2^2 
        
               """

If instead of mean centering y, we pass it raw to lsqr, we actually recover the minimum norm solution. Here is the proof of concept estimator I used to check this:

class LsqrLinearRegression(LinearRegression):

    def __init__(self, fit_intercept=True, tol=1e-5):
        self.fit_intercept = fit_intercept
        self.tol = tol

    def fit(self, X, y):
        if self.fit_intercept:
            X = np.hstack([X, np.ones((X.shape[0], 1))])
        result = sparse_linalg.lsqr(
            X, y, damp=0, atol=self.tol, btol=self.tol, iter_lim=None
        )
        fitted_params = result[0]
        self.n_iter_ = result[2]
        if self.fit_intercept:
            self.coef_ = fitted_params[:-1]
            self.intercept_ = fitted_params[-1]
        else:
            self.coef_ = fitted_params
            self.intercept_ = 0
        return self

So I think we should try to avoid doing the _preprocess_data trick and instead rely on some linear operators when needed to do the sample_weight handling on the fly, both for sparse and dense data.

Still we would need to also empirically study the speed of convergence, both in terms of wall clock time and n_iter_ and the impacts in terms of test R^2 do decide about which approach is the best. Maybe the inductive bias induced by the current _preprocess_data is not that bad.

cc @agramfort @lorentzenchr.

ogrisel · 2023-03-11T13:44:14Z

Haha, I re-read @lorentzenchr description of the issue:

Probable Cause

For wide X, the least squares problem reads a bit different: min ||w||_2 subject to Xw = y with solution w = X'(XX')^-1 y, see e.g. http://ee263.stanford.edu/lectures/min-norm.pdf.
With explicit intercept w0, this reads w = X'(XX' + 1 1')^-1 y, where 1 is a column vector of ones. w0 = 1'(XX' + 1 1')^-1 y.
This is incompatible with our mean centering approach.

So this is confirmed by my quick experiments.

github-actions bot added the Needs Triage Issue requires triage label Mar 25, 2022

lorentzenchr mentioned this issue Mar 25, 2022

TST tight and clean tests for Ridge #22910

Merged

thomasjpfan added Bug module:linear_model and removed Needs Triage Issue requires triage labels Mar 25, 2022

This was referenced Mar 10, 2023

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

Open

The fit performance of LinearRegression is sub-optimal #22855

Open

ogrisel linked a pull request Mar 22, 2023 that will close this issue

FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS #25948

Draft

16 tasks

lorentzenchr mentioned this issue Apr 15, 2023

LinearRegression with zero sample_weights is not the same as excluding those rows #26164

Closed

glemaitre added this to Losses and solvers May 17, 2024

glemaitre moved this to Discussion in Losses and solvers May 17, 2024

This was referenced Oct 8, 2024

FIX LinearRegression sample weight bug (numpy solver) #30030

Closed

Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq #30040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG unpenalized Ridge does not give minimum norm solution #22947

BUG unpenalized Ridge does not give minimum norm solution #22947

BUG unpenalized Ridge does not give minimum norm solution #22947

BUG unpenalized Ridge does not give minimum norm solution #22947

Comments

Describe the bug

Probable Cause

Example