Inconsitency between C-contiguous and F-contiguous arrays #26493

JeanLescutMuller · 2023-06-02T15:05:37Z

No consistency between C-contiguous and F-contiguous arrays for LinearRegression()

At least for LinearRegression() : In some edge case (when X is almost singular), there is huge difference between C-contiguous and F-contiguous arrays predictions.

This is due to the fact that array product gives different results between contiguous and F-contiguous arrays (cf this Stack Overflow questions that I posted)
These "edge cases" can actually be quite common in time-series predictions, where a lot of auto-regressive features can easily be correlated
I would strongly advise parsing all arrays to C-contiguous before doing the predictions/fitting.
Please also note that fitting with F-contiguous or C-contiguous can also give different results.
The worst is not that this is happening, it is that no warning are being raised whatsoever.
Also, F-contiguous arrays are extremely common in pandas DataFrames, which is what a lot of developers are using in this context...

Steps/Code to Reproduce

import numpy as np; print(np.__version__) # 1.23.5
import scipy; print(scipy.__version__) # 1.10.0
import sklearn as sk; print(sk.__version__) # 1.2.1

from sklearn.linear_model import LinearRegression
import pandas as pd

# Parameters 
seed, N_obs, N_feat, mu_x, sigma_x, mu_y, sigma_y = 0, 100, 1000, 100, 0.1, 100, 1

# 1) Creating a weird edge-case X, y :
np.random.seed(seed)
s = pd.Series(np.random.normal(mu_x, sigma_x, N_obs))
X = np.stack([s.ewm(com=com).mean() for com in np.arange(N_feat)]).T
y = np.random.normal(mu_y, sigma_y, N_obs)

# 2) Showing that there is different results for C-cont vs F-cont arrays :
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
y_pred_c = model.predict(np.ascontiguousarray(X))

# Either just plot it and see :
import matplotlib.pyplot as plt
plt.scatter(y_pred, y_pred_c)

# Or look at the data :
np.var(y_pred)
np.var(y_pred - y_pred_c)
np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216
# --> y_pred EXTREMELY different than y_pred_c

Expected Results

We expect y_pred to be fully equal to y_pred_c.
Or at least np.corrcoef(y_pred, y_pred_c)[0,1] > .99

Actual Results

np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216

ypred and y_pred_c are totally different.

Versions

System:
    python: 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
executable: /opt/anaconda3/bin/python
   machine: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.1
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.0
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /opt/anaconda3/lib/libmkl_rt.so.1
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 64
threading_layer: intel

       filepath: /opt/anaconda3/lib/libiomp5.so
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 128

       filepath: /opt/anaconda3/lib/libgomp.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 128

The text was updated successfully, but these errors were encountered:

ogrisel · 2023-06-16T15:09:50Z

This problem goes away if we use a more stable numerical solver, such as "lsqr", for instance via Ridge.

Since there is already a plan to allow LinearRegression to have different solvers and to its default solver to be consistent with Ridge, I think it this is the best way forward (unless you find cases where "lsqr" also fails).

See #22855 (comment) for more details on the context.

JeanLescutMuller added Bug Needs Triage Issue requires triage labels Jun 2, 2023

ogrisel removed the Needs Triage Issue requires triage label Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsitency between C-contiguous and F-contiguous arrays #26493

Inconsitency between C-contiguous and F-contiguous arrays #26493

Inconsitency between C-contiguous and F-contiguous arrays #26493

Inconsitency between C-contiguous and F-contiguous arrays #26493

Comments

No consistency between C-contiguous and F-contiguous arrays for LinearRegression()

Steps/Code to Reproduce

Expected Results

Actual Results

Versions