8000 BUG unpenalized Ridge does not give minimum norm solution · Issue #22947 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BUG unpenalized Ridge does not give minimum norm solution #22947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Mar 25, 2022 · 6 comments · May be fixed by #25948
Open

BUG unpenalized Ridge does not give minimum norm solution #22947

lorentzenchr opened this issue Mar 25, 2022 · 6 comments · May be fixed by #25948

Comments

@lorentzenchr
Copy link
Member
lorentzenchr commented Mar 25, 2022

Describe the bug

As noted in #22910, Ridge(alpha=0, fit_intercept=True) does not give the minimal norm solution for wide data, i.e. n_features > n_samples.

Note that we nowhere guarantee that we provide the minimum norm solution.

Edit: Same seems to hold for LinearRegression, see #26164.

Probable Cause

For wide X, the least squares problem reads a bit different: $\mathrm{min} ||w||_2$ subject to $Xw = y$ with solution $w = X'(XX')^{-1} y$, see e.g. http://ee263.stanford.edu/lectures/min-norm.pdf.
With explicit intercept $w_0$, this reads $w = X'(XX' + 1 1')^{-1} y$, where 1 is a column vector of ones. $w_0 = 1'(XX' + 1 1')^{-1} y$.
This is incompatible with our mean centering approach.

Example

import numpy as np
from numpy.testing import assert_allclose
from scipy import linalg
from sklearn.datasets import make_low_rank_matrix
from sklearn.linear_model import Ridge

n_samples, n_features = 4, 12  # wide data
k = min(n_samples, n_features)
rng = np.random.RandomState(42)
X = make_low_rank_matrix(
    n_samples=n_samples, n_features=n_features, effective_rank=k
)
X[:, -1] = 1  # last columns acts as intercept
U, s, Vt = linalg.svd(X)
assert np.all(s) > 1e-3  # to be sure X is not singular
U1, U2 = U[:, :k], U[:, k:]
Vt1, _ = Vt[:k, :], Vt[k:, :]
y = rng.uniform(low=-10, high=10, size=n_samples)
# w = X'(XX')^-1 y = V s^-1 U' y
coef_ols = Vt1.T @ np.diag(1 / s) @ U1.T @ y

model = Ridge(alpha=0, fit_intercept=True)
X = X[:, :-1]  # remove intercept
intercept = coef_ols[-1]
coef = coef_ols[:-1]
model.fit(X, y)

# Check that we have found a solution => residuals = 0
assert_allclose(model.predict(X), y)
# Check that `coef`, `intercept` also provide a valid solution
assert_allclose(X @ coef + intercept, y)

# Ridge does not give the minimum norm solution. (This should be equal.)
np.linalg.norm(np.r_[model.intercept_, model.coef_]) > np.linalg.norm(
    np.r_[intercept, coef]
)

This last statement should be be False. It proves that Ridge does not give the miminum norm solution.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Mar 25, 2022
@thomasjpfan thomasjpfan added Bug module:linear_model and removed Needs Triage Issue requires triage labels Mar 25, 2022
@ogrisel
Copy link
Member
ogrisel commented Oct 12, 2022

Related: apparently LinearRegression sometimes yield the minimal norm solution rather in a platform dependent manner as seen in this feedback from the MOOC:

So this behavior in the Ridge class might be either platform dependent or dependent on the version of scipy / LAPACK.

@Ahmedbgh
Copy link
Contributor
Ahmedbgh commented Jan 4, 2023

I'm working on it :)

@ogrisel
Copy link
Member
ogrisel commented Mar 10, 2023

I think for Ridge, it's ok to offer the possibility to use solver that do not converge to the minimum solution when alpha is 0 as long as:

  • the docstring for the solver parameter clearly documents which solvers are (numerically) safe to use without penalization,
  • raise an informative ConvergenceWarning when calling Ridge with alpha < np.finfo(X.dtype).eps (or similar) and a value of solver that does not guarantee to remove the minimum norm solution.

For LinearRegression we should probably not expose solvers that are not guaranteed to converge to the minimum norm solution. This specific issue is being discussed in #22855.

@ogrisel
Copy link
Member
ogrisel commented Mar 11, 2023

Note that the default solver used in the snippet above will fallback to "svd" because the problem is singular.

I tried the snippet above with "lsqr" and unfortunately is does not converge to the minimum norm solution either.

As a sanity check I implemented the following naive solver:

class PInvLinearRegression(LinearRegression):

    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept

    def fit(self, X, y):
        if self.fit_intercept:
            X = np.hstack([X, np.ones((X.shape[0], 1))])
        fitted_params = np.linalg.pinv(X.T @ X) @ X.T @ y
        if self.fit_intercept:
            self.coef_ = fitted_params[:-1]
            self.intercept_ = fitted_params[-1]
        else:
            self.coef_ = fitted_params
            self.intercept_ = 0
        return self

and I confirm that this approach recovers the expected minimal norm solution successfully, so the script in the description of this issue is correct.

This pinv-based solver is not suitable for large n_features in practice. Furthermore if we where to implement it in scikit-learn (for the small n_features case) we would need to find a way to avoid the memory copy with hstack and add support for sample_weight (possibly via a linear operator?).

I suspect that to recover the minimum norm solution for iterative solvers such as "lbfgs" or "lsqr", we need to do a dump init of the all the parameters (including the intercept) at zero.

@ogrisel
Copy link
Member
ogrisel commented Mar 11, 2023

I suspect that to recover the minimum norm solution for iterative solvers such as "lbfgs" or "lsqr", we need to do a dump init of the all the parameters (including the intercept) at zero.

Actually, for lsqr, this is not really an init problem but more of the way we pose the problem. We preprocess the data based on fit_intercept to mean center y and then we use that assumption to simplify the handling of the sample weight:

y_offset = np.average(y, axis=0, weights=sample_weight)
y = y - y_offset

def _solve_lsqr(
X,
y,
*,
alpha,
fit_intercept=True,
max_iter=None,
tol=1e-4,
X_offset=None,
X_scale=None,
sample_weight_sqrt=None,
):
"""Solve Ridge regression via LSQR.
We expect that y is always mean centered.
If X is dense, we expect it to be mean centered such that we can solve
||y - Xw||_2^2 + alpha * ||w||_2^2
If X is sparse, we expect X_offset to be given such that we can solve
||y - (X - X_offset)w||_2^2 + alpha * ||w||_2^2
With sample weights S=diag(sample_weight), this becomes
||sqrt(S) (y - (X - X_offset) w)||_2^2 + alpha * ||w||_2^2 8000
and we expect y and X to already be rescaled, i.e. sqrt(S) @ y, sqrt(S) @ X. In
this case, X_offset is the sample_weight weighted mean of X before scaling by
sqrt(S). The objective then reads
||y - (X - sqrt(S) X_offset) w)||_2^2 + alpha * ||w||_2^2
"""

If instead of mean centering y, we pass it raw to lsqr, we actually recover the minimum norm solution. Here is the proof of concept estimator I used to check this:

class LsqrLinearRegression(LinearRegression):

    def __init__(self, fit_intercept=True, tol=1e-5):
        self.fit_intercept = fit_intercept
        self.tol = tol

    def fit(self, X, y):
        if self.fit_intercept:
            X = np.hstack([X, np.ones((X.shape[0], 1))])
        result = sparse_linalg.lsqr(
            X, y, damp=0, atol=self.tol, btol=self.tol, iter_lim=None
        )
        fitted_params = result[0]
        self.n_iter_ = result[2]
        if self.fit_intercept:
            self.coef_ = fitted_params[:-1]
            self.intercept_ = fitted_params[-1]
        else:
            self.coef_ = fitted_params
            self.intercept_ = 0
        return self

So I think we should try to avoid doing the _preprocess_data trick and instead rely on some linear operators when needed to do the sample_weight handling on the fly, both for sparse and dense data.

Still we would need to also empirically study the speed of convergence, both in terms of wall clock time and n_iter_ and the impacts in terms of test R^2 do decide about which approach is the best. Maybe the inductive bias induced by the current _preprocess_data is not that bad.

cc @agramfort @lorentzenchr.

@ogrisel
Copy link
Member
ogrisel commented Mar 11, 2023

Haha, I re-read @lorentzenchr description of the issue:

Probable Cause

For wide X, the least squares problem reads a bit different: min ||w||_2 subject to Xw = y with solution w = X'(XX')^-1 y, see e.g. http://ee263.stanford.edu/lectures/min-norm.pdf.
With explicit intercept w0, this reads w = X'(XX' + 1 1')^-1 y, where 1 is a column vector of ones. w0 = 1'(XX' + 1 1')^-1 y.
This is incompatible with our mean centering approach.

So this is confirmed by my quick experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
4 participants
0