8000 LinearRegression fits wrongly on csr sparse matrix with sample weights · Issue #19578 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

LinearRegression fits wrongly on csr sparse matrix with sample weights #19578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
memeplex opened this issue Feb 26, 2021 · 6 comments · Fixed by #22891
Closed

LinearRegression fits wrongly on csr sparse matrix with sample weights #19578

memeplex opened this issue Feb 26, 2021 · 6 comments · Fixed by #22891

Comments

@memeplex
Copy link

Describe the bug

Fitting a model from a csr sparse matrix while passing sample_weight gives a very wrong fit.

This report is for LinearRegression, but I've been experiencing wrong behavior in a similar setup with Ridge.

Steps/Code to Reproduce

Taken from a small sample of a real life dataset, just copy&paste:

import numpy as np
import scipy as sp

from sklearn.linear_model import LinearRegression

X = np.matrix(
       [[5.33558618e+00, 1.42240293e-03, 3.79195466e-07, 5.33558618e+00,
         1.42240293e-03, 3.79195466e-07, 5.33558618e+00, 1.42240293e-03,
         3.79195466e-07, 1.98842115e+00, 5.30089098e-04, 1.41315361e-07,
         1.98842115e+00, 5.30089098e-04, 1.41315361e-07, 1.98842115e+00,
         5.30089098e-04, 1.41315361e-07, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [4.02363338e+01, 2.14360316e-03, 1.14201123e-07, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 6.57402127e+00, 3.50233021e-04, 1.86587728e-08,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 4.02363338e+01, 2.14360316e-03,
         1.14201123e-07, 4.02363338e+01, 2.14360316e-03, 1.14201123e-07,
         6.57402127e+00, 3.50233021e-04, 1.86587728e-08, 6.57402127e+00,
         3.50233021e-04, 1.86587728e-08, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [4.73842584e+01, 6.78046657e-03, 9.70253169e-07, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 5.75883825e+00, 8.24062917e-04, 1.17919563e-07,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 4.73842584e+01, 6.78046657e-03,
         9.70253169e-07, 4.73842584e+01, 6.78046657e-03, 9.70253169e-07,
         5.75883825e+00, 8.24062917e-04, 1.17919563e-07, 5.75883825e+00,
         8.24062917e-04, 1.17919563e-07, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [6.73948024e+01, 1.51450045e-02, 3.40339540e-06, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 5.09430891e+00, 1.14479646e-03, 2.57259416e-07,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 6.73948024e+01, 1.51450045e-02,
         3.40339540e-06, 6.73948024e+01, 1.51450045e-02, 3.40339540e-06,
         5.09430891e+00, 1.14479646e-03, 2.57259416e-07, 5.09430891e+00,
         1.14479646e-03, 2.57259416e-07, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [2.20048734e+01, 4.36703062e-03, 8.66669673e-07, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 3.18056958e+00, 6.31207665e-04, 1.25267851e-07,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 2.20048734e+01, 4.36703062e-03,
         8.66669673e-07, 2.20048734e+01, 4.36703062e-03, 8.66669673e-07,
         3.18056958e+00, 6.31207665e-04, 1.25267851e-07, 3.18056958e+00,
         6.31207665e-04, 1.25267851e-07]])

y = np.array([ 0.        , -0.0003287 ,  0.        , -0.0011448 , -0.00063121])
w = np.array([50000.,  2500., 50000.,  2500.,  2500.])

lr = LinearRegression().fit(X, y, sample_weight=w)
print(
    "dense:",
    lr.score(X, y, sample_weight=w),
    lr.coef_[:5],
)

lr = LinearRegression().fit(X, y)
print(
    "dense (no weights):",
    lr.score(X, y),
    lr.coef_[:5],
)

X = sp.sparse.csr_matrix(X)
lr = LinearRegression().fit(X, y, sample_weight=w)
print(
    "sparse:",
    lr.score(X, y, sample_weight=w),
    lr.coef_[:5],
)

lr = LinearRegression().fit(X, y)
print(
    "sparse (no weights):",
    lr.score(X, y),
    lr.coef_[:5],
)

Expected Results

Both weighted results are the same.

Both unweighted results are the same.

Actual Results

Weighted results differ. The one for the sparse matrix is clearly wrong.

Unweighted results are the same.

Versions

System:
    python: 3.8.6 (default, Nov 20 2020, 23:57:10)  [Clang 12.0.0 (clang-1200.0.32.27)]
executable: /Users/carlos/Base/Venvs/Default/bin/python3
   machine: macOS-11.2-x86_64-i386-64bit

Python dependencies:
          pip: 21.0.1
   setuptools: 49.2.1
      sklearn: 0.24.1
        numpy: 1.20.1
        scipy: 1.6.0
       Cython: 0.29.21
       pandas: 1.2.2
   matplotlib: 3.1.2
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True
@memeplex
Copy link
Author

OTOH, here is a simple example that does work:

X = sp.sparse.csr_matrix(np.array([[1, 0]] * 100 + [[0, 1]] * 100))
y = np.array([1] * 80 + [0] * 20 + [1] * 50 + [0] * 50)
w = np.array([1] * 80 + [4] * 20 + [1] * 50 + [2] * 50)
LinearRegression(fit_intercept=False).fit(X, y, sample_weight=w).coef_

=> array([0.5       , 0.33333333])

@jnothman
Copy link
Member
jnothman commented Feb 27, 2021 via email

@memeplex
Copy link
Author

No, it doesn't. Dividing them by 100 or 10000 still throws different results for the sparse/weighted case. Using an all ones weight vector works fine but as soon as I change some of the weights to, say, 2, the sparse/weighted case fails again.

@memeplex
Copy link
Author
memeplex commented Feb 27, 2021

Another fact to take into account is that the matrix X'X is very ill conditioned, so if some kind of iterative solver is chosen for sparse input (I don't know if that's the case), then you have problems in both fronts: observations wildly differ in weight while features have very different "geometries", then it might be very difficult to accommodate a step schedule that converges.

@cmarmo
Copy link
Contributor
cmarmo commented Mar 5, 2021

Hi @memeplex I believe your findings are similar to #15438 in particular as said in @lorentzenchr comment.

@cmarmo cmarmo added Bug and removed Bug: triage labels Mar 5, 2021
@lorentzenchr
Copy link
Member

If so, we can close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0