8000 LinearRegression fits wrongly on csr sparse matrix with sample weights · Issue #19578 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
LinearRegression fits wrongly on csr sparse matrix with sample weights #19578
Closed
@memeplex

Description

@memeplex

Describe the bug

Fitting a model from a csr sparse matrix while passing sample_weight gives a very wrong fit.

This report is for LinearRegression, but I've been experiencing wrong behavior in a similar setup with Ridge.

Steps/Code to Reproduce

Taken from a small sample of a real life dataset, just copy&paste:

import numpy as np
import scipy as sp

from sklearn.linear_model import LinearRegression

X = np.matrix(
       [[5.33558618e+00, 1.42240293e-03, 3.79195466e-07, 5.33558618e+00,
         1.42240293e-03, 3.79195466e-07, 5.33558618e+00, 1.42240293e-03,
         3.79195466e-07, 1.98842115e+00, 5.30089098e-04, 1.41315361e-07,
         1.98842115e+00, 5.30089098e-04, 1.41315361e-07, 1.98842115e+00,
         5.30089098e-04, 1.41315361e-07, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [4.02363338e+01, 2.14360316e-03, 1.14201123e-07, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 6.57402127e+00, 3.50233021e-04, 1.86587728e-08,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 4.02363338e+01, 2.14360316e-03,
         1.14201123e-07, 4.02363338e+01, 2.14360316e-03, 1.14201123e-07,
         6.57402127e+00, 3.50233021e-04, 1.86587728e-08, 6.57402127e+00,
         3.50233021e-04, 1.86587728e-08, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [4.73842584e+01, 6.78046657e-03, 9.70253169e-07, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 5.75883825e+00, 8.24062917e-04, 1.17919563e-07,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 4.73842584e+01, 6.78046657e-03,
         9.70253169e-07, 4.73842584e+01, 6.78046657e-03, 9.70253169e-07,
         5.75883825e+00, 8.24062917e-04, 1.17919563e-07, 5.75883825e+00,
         8.24062917e-04, 1.17919563e-07, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [6.73948024e+01, 1.51450045e-02, 3.40339540e-06, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 5.09430891e+00, 1.14479646e-03, 2.57259416e-07,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 6.73948024e+01, 1.51450045e-02,
         3.40339540e-06, 6.73948024e+01, 1.51450045e-02, 3.40339540e-06,
         5.09430891e+00, 1.14479646e-03, 2.57259416e-07, 5.09430891e+00,
         1.14479646e-03, 2.57259416e-07, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00],
        [2.20048734e+01, 4.36703062e-03, 8.66669673e-07, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 3.18056958e+00, 6.31207665e-04, 1.25267851e-07,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 2.20048734e+01, 4.36703062e-03,
         8.66669673e-07, 2.20048734e+01, 4.36703062e-03, 8.66669673e-07,
         3.18056958e+00, 6.31207665e-04, 1.25267851e-07, 3.18056958e+00,
         6.31207665e-04, 1.25267851e-07]])

y = np.array([ 0.        , -0.0003287 ,  0.        , -0.0011448 , -0.00063121])
w = np.array([50000.,  2500., 50000.,  2500.,  2500.])

lr = LinearRegression().fit(X, y, sample_weight=w)
print(
    "dense:",
    lr.score(X, y, sample_weight=w),
    lr.coef_[:5],
)

lr = LinearRegression().fit(X, y)
print(
    "dense (no weights):",
    lr.score(X, y),
    lr.coef_[:5],
)

X = sp.sparse.csr_matrix(X)
lr = LinearRegression().fit(X, y, sample_weight=w)
print(
    "sparse:",
    lr.score(X, y, sample_weight=w),
    lr.coef_[:5],
)

lr = LinearRegression().fit(X, y)
print(
    "sparse (no weights):",
    lr.score(X, y),
    lr.coef_[:5],
)

Expected Results

Both weighted results are the same.

Both unweighted results are the same.

Actual Results

Weighted results differ. The one for the sparse matrix is clearly wrong.

Unweighted results are the same.

Versions

System:
    python: 3.8.6 (default, Nov 20 2020, 23:57:10)  [Clang 12.0.0 (clang-1200.0.32.27)]
executable: /Users/carlos/Base/Venvs/Default/bin/python3
   machine: macOS-11.2-x86_64-i386-64bit

Python dependencies:
          pip: 21.0.1
   setuptools: 49.2.1
      sklearn: 0.24.1
        numpy: 1.20.1
        scipy: 1.6.0
       Cython: 0.29.21
       pandas: 1.2.2
   matplotlib: 3.1.2
       joblib: 1.0.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0