You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No consistency between C-contiguous and F-contiguous arrays for LinearRegression()
At least for LinearRegression() : In some edge case (when X is almost singular), there is huge difference between C-contiguous and F-contiguous arrays predictions.
These "edge cases" can actually be quite common in time-series predictions, where a lot of auto-regressive features can easily be correlated
I would strongly advise parsing all arrays to C-contiguous before doing the predictions/fitting.
Please also note that fitting with F-contiguous or C-contiguous can also give different results.
The worst is not that this is happening, it is that no warning are being raised whatsoever.
Also, F-contiguous arrays are extremely common in pandas DataFrames, which is what a lot of developers are using in this context...
Steps/Code to Reproduce
importnumpyasnp; print(np.__version__) # 1.23.5importscipy; print(scipy.__version__) # 1.10.0importsklearnassk; print(sk.__version__) # 1.2.1fromsklearn.linear_modelimportLinearRegressionimportpandasaspd# Parameters seed, N_obs, N_feat, mu_x, sigma_x, mu_y, sigma_y=0, 100, 1000, 100, 0.1, 100, 1# 1) Creating a weird edge-case X, y :np.random.seed(seed)
s=pd.Series(np.random.normal(mu_x, sigma_x, N_obs))
X=np.stack([s.ewm(com=com).mean() forcominnp.arange(N_feat)]).Ty=np.random.normal(mu_y, sigma_y, N_obs)
# 2) Showing that there is different results for C-cont vs F-cont arrays :model=LinearRegression()
model.fit(X, y)
y_pred=model.predict(X)
y_pred_c=model.predict(np.ascontiguousarray(X))
# Either just plot it and see :importmatplotlib.pyplotaspltplt.scatter(y_pred, y_pred_c)
# Or look at the data :np.var(y_pred)
np.var(y_pred-y_pred_c)
np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216# --> y_pred EXTREMELY different than y_pred_c
Expected Results
We expect y_pred to be fully equal to y_pred_c.
Or at least np.corrcoef(y_pred, y_pred_c)[0,1] > .99
This problem goes away if we use a more stable numerical solver, such as "lsqr", for instance via Ridge.
Since there is already a plan to allow LinearRegression to have different solvers and to its default solver to be consistent with Ridge, I think it this is the best way forward (unless you find cases where "lsqr" also fails).
No consistency between C-contiguous and F-contiguous arrays for LinearRegression()
At least for LinearRegression() : In some edge case (when X is almost singular), there is huge difference between C-contiguous and F-contiguous arrays predictions.
Steps/Code to Reproduce
Expected Results
We expect y_pred to be fully equal to y_pred_c.
Or at least
np.corrcoef(y_pred, y_pred_c)[0,1] > .99
Actual Results
np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216
Versions
The text was updated successfully, but these errors were encountered: