10000 fit_intercept in RidgeCV with sparse design matrix and gcv_mode='svd' · Issue #13325 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

fit_intercept in RidgeCV with sparse design matrix and gcv_mode='svd' #13325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromedockes opened this issue Feb 28, 2019 · 9 comments · Fixed by #13350
Closed

fit_intercept in RidgeCV with sparse design matrix and gcv_mode='svd' #13325

jeromedockes opened this issue Feb 28, 2019 · 9 comments · Fixed by #13350

Comments

@jeromedockes
Copy link
Contributor

Currently the ridge with generalized cross-validation which uses an SVD of the
design matrix (the best when there are more samples than features) does not
support sparse design matrices.

this results in silently using an eigendecomposition of the Gram matrix when
gcv_mode is 'auto'

if sparse.issparse(X) or n_features > n_samples or with_sw:

or raising an error when the user explicitly asked for svd

raise TypeError("SVD not supported for sparse matrices")

would it be better to:

  • allow sparse design matrices and compute the SVD using scipy.sparse.linalg.LinearOperator

  • warn the user that 'eigen' is being used, which can be very inefficient when
    n_samples is much larger than n_features, so that they can decide if it is
    better to use 'eigen', cast X to a dense matrix, or turn to another mode of CV
    than the generalized CV

@orausch
Copy link
orausch commented Feb 28, 2019

Although scipy also uses an eigendecomposition to compute the SVD in the sparse case, they use a sparse eigensolver, which should be more performant. I'll write some stability and performance benchmarks and take a look.

@jeromedockes
Copy link
Contributor Author

from what I understand
(https://github.com/scipy/scipy/blob/c3fa90dcfcaef71658744c73578e9e7d915c81e9/scipy/sparse/linalg/eigen/arpack/arpack.py#L1810)
they use an eigendecomposition of the smallest of XX^T and X^TX (in our case ,
X^TX). when there are many samples this is more efficient than constructing the
gram matrix and taking its eigendecomposition as is done currently
(

K = safe_sparse_dot(X, X.T, dense_output=True)
)

If I can be of assistance, I am willing to work on comparing the performance of
possible solutions and opening a PR if it turns out that _RidgeGCV can be
improved.

@orausch
Copy link
orausch commented Feb 28, 2019

Sure, if you'd like to take this up, you're welcome to.

@jeromedockes
Copy link
Contributor Author

thanks, I'll start working on a benchmark then.

@orausch
Copy link
orausch commented Feb 28, 2019

I've spoken with some core devs, and not checking the dimensions is wrong behavior. What remains to be seen from your benchmark is whether using the sparse eigen solver is always better than the dense solver (the matrix we get after multiplying is usually not sparse anymore, so this might not be the case).

@jeromedockes
Copy link
Contributor Author

[WIP] Handle n_samples > n_features efficiently for sparse matricies in RidgeCV #13327

actually my point was that the method is switched to the dual, I started a draft for adding support for fitting an intercept when x is sparse and we use an svd

https://github.com/jeromedockes/scikit-learn/tree/ridge_gcv_svd_sparse_x

but it is not reviewable and only a draft

@jeromedockes
Copy link
Contributor Author
jeromedockes commented Feb 28, 2019

and it seems that the problem of no intercept being fitted happens in many places. for example this script


import warnings
import numpy as np
from scipy import sparse

import sklearn
from sklearn.linear_model import LinearRegression, RidgeCV, Ridge
from sklearn.datasets import make_regression

warnings.simplefilter('ignore')


x, y, c = make_regression(n_features=5, random_state=0, coef=True, noise=17)
x += 10 * np.arange(x.shape[1])

print('dense:\n=======\n')

ridge = RidgeCV(fit_intercept=True).fit(x, y)
print('RidgeCV, gcv, dense: alpha: ', ridge.alpha_)
print('RidgeCV, gcv, dense: coef: ', ridge.coef_)
print('RidgeCV, gcv, dense: intercept: ', ridge.intercept_)
print()

ridge = RidgeCV(cv=10, fit_intercept=True).fit(x, y)
print('RidgeCV, k-fold, dense: alpha: ', ridge.alpha_)
print('RidgeCV, k-fold, dense: coef: ', ridge.coef_)
print('RidgeCV, k-fold, dense: intercept: ', ridge.intercept_)
print()

ridge = Ridge(alpha=1., fit_intercept=True).fit(x, y)
print('Ridge, dense: coef: ', ridge.coef_)
print('Ridge, dense: intercept: ', ridge.intercept_)
print()

reg = LinearRegression(fit_intercept=True).fit(x, y)
print('LinearRegression, dense: coef: ', reg.coef_)
print('LinearRegression, dense: intercept: ', reg.intercept_)
print()

xd = x.copy()
x = sparse.csr_matrix(x)
assert np.allclose(x.A, xd)

print('sparse:\n=======\n')

ridge = RidgeCV(fit_intercept=True).fit(x, y)
print('RidgeCV, gcv, sparse: alpha: ', ridge.alpha_)
print('RidgeCV, gcv, sparse: coef: ', ridge.coef_)
print('RidgeCV, gcv, sparse: intercept: ', ridge.intercept_)
print()

ridge = RidgeCV(cv=10, fit_intercept=True).fit(x, y)
print('RidgeCV, k-fold, sparse: alpha: ', ridge.alpha_)
print('RidgeCV, k-fold, sparse: coef: ', ridge.coef_)
print('RidgeCV, k-fold, sparse: intercept: ', ridge.intercept_)
print()

ridge = Ridge(alpha=1., fit_intercept=True).fit(x, y)
print('Ridge, sparse: coef: ', ridge.coef_)
print('Ridge, sparse: intercept: ', ridge.intercept_)
print()

ridge = Ridge(alpha=1., solver='sag', fit_intercept=True).fit(x, y)
print('Ridge, sag, sparse: coef: ', ridge.coef_)
print('Ridge, sag, sparse: intercept: ', ridge.intercept_)
print()

reg = LinearRegression(fit_intercept=True).fit(x, y)
print('LinearRegression, sparse: coef: ', reg.coef_)
print('LinearRegression, sparse: intercept: ', reg.intercept_)
print()

print('true coef: ', c)

print('scikit-learn version: ', sklearn.__version__)




produces this output for me:


dense:
=======

RidgeCV, gcv, dense: alpha:  0.1
RidgeCV, gcv, dense: coef:  [44.32951003 86.90409431 99.25893972 10.5733404  42.02057501]
RidgeCV, gcv, dense: intercept:  -4855.569697787837

RidgeCV, k-fold, dense: alpha:  0.1
RidgeCV, k-fold, dense: coef:  [44.32951003 86.90409431 99.25893972 10.5733404  42.02057501]
RidgeCV, k-fold, dense: intercept:  -4855.5696977877315

Ridge, dense: coef:  [43.89237086 86.25403124 98.23932835 10.6778846  41.50343272]
Ridge, dense: intercept:  -4811.296413488582

LinearRegression, dense: coef:  [44.37862835 86.97692638 99.37359873 10.56134858 42.0788127 ]
LinearRegression, dense: intercept:  -4860.541944266553

sparse:
=======

RidgeCV, gcv, sparse: alpha:  1.0
RidgeCV, gcv, sparse: coef:  [ 32.16135726  79.45461533  61.75569117 -33.16699471 -25.4457675 ]
RidgeCV, gcv, sparse: intercept:  -18.67194084012232

RidgeCV, k-fold, sparse: alpha:  10.0
RidgeCV, k-fold, sparse: coef:  [ 30.00723617  71.60350043  53.74750801 -26.59845034 -24.45253152]
RidgeCV, k-fold, sparse: intercept:  -18.723607701609872

Ridge, sparse: coef:  [ 32.03372303  77.51140523  58.86414654 -30.0355813  -25.88298155]
Ridge, sparse: intercept:  -18.7277843818323

Ridge, sag, sparse: coef:  [ 32.06900863  77.37481141  58.83314793 -29.94541578 -25.89642019]
Ridge, sag, sparse: intercept:  -18.726977098076624

LinearRegression, sparse: coef:  [44.37862835 86.97692638 99.37359873 10.56134858 42.0788127 ]
LinearRegression, sparse: intercept:  -4860.54194426655

true coef:  [45.70587613 85.71249175 97.99623263 11.73155642 42.37063535]
scikit-learn version:  0.21.dev0

I thought this issue had been fixed for Ridge by #5360, but it does not seem to
be the case?

@jeromedockes
Copy link
Contributor Author

After some thought, it makes sense to refuse using the SVD when X is sparse
because it would create a dense matrix of the same size as X (the left singular
vectors). However, it would be useful to:

  • handle the intercept correctly in the "eigen" method when X is sparse
  • provide a clear warning message when using "eigen" if it may be inefficient,
    so that the user can decide if they prefer to use "eigen" anyway, cast X to a
    dense matrix or use some other form of cross validation than GCV.

An attempt at those two things (work in progress) in #13350

@jeromedockes
Copy link
Contributor Author

note that switching to eigen when X is sparse and n > p is not a good idea, as it allocates the gram matrix, which is dense and (possibly much) larger than X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0