-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Bug in sparse in Ridge with sample weights #15438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Same for |
Is anybody looking at this? It seems like linear regression not working for two years in the major python statistics package is not great. |
@murphyhopfensperger Can you still reproduce this issue with the latest version? |
Input: import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
def Xyw(sparse=True):
rng = np.random.default_rng(seed=0)
n, d = 2 * 1_000, 2
df = pd.DataFrame({"x1": rng.choice(["a", "b", "c"], size=n),})
enc = OneHotEncoder(drop=["c"])
X = enc.fit_transform(df)
beta = np.array([2, 3])
y = X @ beta + 10
w = np.zeros(shape=n)
w[: n // 2] = 1
if not sparse:
X = X.toarray()
return {"X": X, "y": y, "sample_weight": w}
ols = LinearRegression()
ols.fit(**Xyw())
print(ols.coef_, ols.intercept_)
ols = LinearRegression()
ols.fit(**Xyw(sparse=False))
print(ols.coef_, ols.intercept_) Output: [0.38 1.38] 11.0449
[2. 3.] 10.000000000000004 Maybe I'm misunderstanding something (very possible!), but I thought the result should be the same whether or not |
Yes, that certainly doesn't look good. If you are willing to investigate, could you check if scikit-learn/sklearn/linear_model/_ridge.py Line 557 in 99f2dde
produces the same output for the dense and sparse case on your example? If not the bug is there (but that's unlikely since we would have this issue for all linear models if it's the case). Otherwise it means it's an issue
8000
with the solver. It would be interesting if this happens for all Ridge solvers. For instance, we could try directly calling, scikit-learn/sklearn/linear_model/_ridge.py Line 572 in 99f2dde
or scikit-learn/sklearn/linear_model/_ridge.py Line 584 in 99f2dde
on the rescaled dense/sparse data and checking the coefficients. |
I tried _rescale_data() to verify that it gives the same result for sparse and nonsparse. X1, y1 = _rescale_data(**Xyw())
X1 = X1.toarray()
X2, y2 = _rescale_data(**Xyw(sparse=False))
np.array_equal(X1, X2), np.array_equal(y1, y2) Gives the result I probably won't test the others right away. |
I can confirm the issue on at least LinearRegression and Ridge. It's probably wider since it comes from forgetting to scale the X_offset by the sample weights (on sparse, X is not actually centered, the centering is implicit). |
Jérémie, Thanks. |
@s-banach can you open a dedicated issue with a reproducible example and explaining what you'd expect ? |
Here is the issue: #22914 Thanks for your time. |
Description
Ridge
with sample weights gives wrong results for sparse input.Steps/Code to Reproduce
Expected Results
No AssertionError is thrown.
Actual Results
Last two assertion statements throw AssertionError.
Versions
System:
python: 3.7.2
sklearn: 0.22.dev0
commit 9caf835
Author: Thomas J Fan thomasjpfan@gmail.com
Date: Sun Oct 27 03:32:45 2019 -0400
The text was updated successfully, but these errors were encountered: