8000 Inconsitency between C-contiguous and F-contiguous arrays · Issue #26493 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Inconsitency between C-contiguous and F-contiguous arrays #26493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JeanLescutMuller opened this issue Jun 2, 2023 · 1 comment
Open

Inconsitency between C-contiguous and F-contiguous arrays #26493

JeanLescutMuller opened this issue Jun 2, 2023 · 1 comment
Labels

Comments

@JeanLescutMuller
Copy link
JeanLescutMuller commented Jun 2, 2023

No consistency between C-contiguous and F-contiguous arrays for LinearRegression()

At least for LinearRegression() : In some edge case (when X is almost singular), there is huge difference between C-contiguous and F-contiguous arrays predictions.

  • This is due to the fact that array product gives different results between contiguous and F-contiguous arrays (cf this Stack Overflow questions that I posted)
  • These "edge cases" can actually be quite common in time-series predictions, where a lot of auto-regressive features can easily be correlated
  • I would strongly advise parsing all arrays to C-contiguous before doing the predictions/fitting.
  • Please also note that fitting with F-contiguous or C-contiguous can also give different results.
  • The worst is not that this is happening, it is that no warning are being raised whatsoever.
  • Also, F-contiguous arrays are extremely common in pandas DataFrames, which is what a lot of developers are using in this context...

Steps/Code to Reproduce

import numpy as np; print(np.__version__) # 1.23.5
import scipy; print(scipy.__version__) # 1.10.0
import sklearn as sk; print(sk.__version__) # 1.2.1

from sklearn.linear_model import LinearRegression
import pandas as pd

# Parameters 
seed, N_obs, N_feat, mu_x, sigma_x, mu_y, sigma_y = 0, 100, 1000, 100, 0.1, 100, 1

# 1) Creating a weird edge-case X, y :
np.random.seed(seed)
s = pd.Series(np.random.normal(mu_x, sigma_x, N_obs))
X = np.stack([s.ewm(com=com).mean() for com in np.arange(N_feat)]).T
y = np.random.normal(mu_y, sigma_y, N_obs)

# 2) Showing that there is different results for C-cont vs F-cont arrays :
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
y_pred_c = model.predict(np.ascontiguousarray(X))

# Either just plot it and see :
import matplotlib.pyplot as plt
plt.scatter(y_pred, y_pred_c)

# Or look at the data :
np.var(y_pred)
np.var(y_pred - y_pred_c)
np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216
# --> y_pred EXTREMELY different than y_pred_c

Expected Results

We expect y_pred to be fully equal to y_pred_c.
Or at least np.corrcoef(y_pred, y_pred_c)[0,1] > .99

Actual Results

np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216

  • ypred and y_pred_c are totally different.

Versions

System:
    python: 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
executable: /opt/anaconda3/bin/python
   machine: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.1
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.0
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /opt/anaconda3/lib/libmkl_rt.so.1
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 64
threading_layer: intel

       filepath: /opt/anaconda3/lib/libiomp5.so
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 128

       filepath: /opt/anaconda3/lib/libgomp.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 128
@JeanLescutMuller JeanLescutMuller added Bug Needs Triage Issue requires triage labels Jun 2, 2023
@ogrisel
Copy link
Member
ogrisel commented Jun 16, 2023

This problem goes away if we use a more stable numerical solver, such as "lsqr", for instance via Ridge.

Since there is already a plan to allow LinearRegression to have different solvers and to its default solver to be consistent with Ridge, I think it this is the best way forward (unless you find cases where "lsqr" also fails).

See #22855 (comment) for more details on the context.

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0