Closed
Description
Describe the bug
Fitting a model from a csr sparse matrix while passing sample_weight gives a very wrong fit.
This report is for LinearRegression
, but I've been experiencing wrong behavior in a similar setup with Ridge
.
Steps/Code to Reproduce
Taken from a small sample of a real life dataset, just copy&paste:
import numpy as np
import scipy as sp
from sklearn.linear_model import LinearRegression
X = np.matrix(
[[5.33558618e+00, 1.42240293e-03, 3.79195466e-07, 5.33558618e+00,
1.42240293e-03, 3.79195466e-07, 5.33558618e+00, 1.42240293e-03,
3.79195466e-07, 1.98842115e+00, 5.30089098e-04, 1.41315361e-07,
1.98842115e+00, 5.30089098e-04, 1.41315361e-07, 1.98842115e+00,
5.30089098e-04, 1.41315361e-07, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[4.02363338e+01, 2.14360316e-03, 1.14201123e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 6.57402127e+00, 3.50233021e-04, 1.86587728e-08,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 4.02363338e+01, 2.14360316e-03,
1.14201123e-07, 4.02363338e+01, 2.14360316e-03, 1.14201123e-07,
6.57402127e+00, 3.50233021e-04, 1.86587728e-08, 6.57402127e+00,
3.50233021e-04, 1.86587728e-08, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[4.73842584e+01, 6.78046657e-03, 9.70253169e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 5.75883825e+00, 8.24062917e-04, 1.17919563e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 4.73842584e+01, 6.78046657e-03,
9.70253169e-07, 4.73842584e+01, 6.78046657e-03, 9.70253169e-07,
5.75883825e+00, 8.24062917e-04, 1.17919563e-07, 5.75883825e+00,
8.24062917e-04, 1.17919563e-07, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[6.73948024e+01, 1.51450045e-02, 3.40339540e-06, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 5.09430891e+00, 1.14479646e-03, 2.57259416e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 6.73948024e+01, 1.51450045e-02,
3.40339540e-06, 6.73948024e+01, 1.51450045e-02, 3.40339540e-06,
5.09430891e+00, 1.14479646e-03, 2.57259416e-07, 5.09430891e+00,
1.14479646e-03, 2.57259416e-07, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[2.20048734e+01, 4.36703062e-03, 8.66669673e-07, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.18056958e+00, 6.31207665e-04, 1.25267851e-07,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 2.20048734e+01, 4.36703062e-03,
8.66669673e-07, 2.20048734e+01, 4.36703062e-03, 8.66669673e-07,
3.18056958e+00, 6.31207665e-04, 1.25267851e-07, 3.18056958e+00,
6.31207665e-04, 1.25267851e-07]])
y = np.array([ 0. , -0.0003287 , 0. , -0.0011448 , -0.00063121])
w = np.array([50000., 2500., 50000., 2500., 2500.])
lr = LinearRegression().fit(X, y, sample_weight=w)
print(
"dense:",
lr.score(X, y, sample_weight=w),
lr.coef_[:5],
)
lr = LinearRegression().fit(X, y)
print(
"dense (no weights):",
lr.score(X, y),
lr.coef_[:5],
)
X = sp.sparse.csr_matrix(X)
lr = LinearRegression().fit(X, y, sample_weight=w)
print(
"sparse:",
lr.score(X, y, sample_weight=w),
lr.coef_[:5],
)
lr = LinearRegression().fit(X, y)
print(
"sparse (no weights):",
lr.score(X, y),
lr.coef_[:5],
)
Expected Results
Both weighted results are the same.
Both unweighted results are the same.
Actual Results
Weighted results differ. The one for the sparse matrix is clearly wrong.
Unweighted results are the same.
Versions
System:
python: 3.8.6 (default, Nov 20 2020, 23:57:10) [Clang 12.0.0 (clang-1200.0.32.27)]
executable: /Users/carlos/Base/Venvs/Default/bin/python3
machine: macOS-11.2-x86_64-i386-64bit
Python dependencies:
pip: 21.0.1
setuptools: 49.2.1
sklearn: 0.24.1
numpy: 1.20.1
scipy: 1.6.0
Cython: 0.29.21
pandas: 1.2.2
matplotlib: 3.1.2
joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True