-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Failing CI for check_sample_weight_equivalence_on_dense_data with LinearRegerssion on debian_32bit #31098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The pip version is probably not the culprit. Maybe this is an openblas problem when running in 32 bit only on some CPU models? However, both detect the BLAS kernels optimized for the Nehalem architecture. |
Also, the magnitude of the violations in the failing
|
The two failures involve the same checks on |
Confirming the same issue on PR #31031 This PR only modifies the positive implementation of Ridge _lbfgs solver. It does not touch the base linear regression class. |
As a possible culprit, |
There was a reimplementation of |
I'm not familiar with the handling of float32 vs float64 arrays in scipy, but it seems that the cython nnls is written for float64 arrays ? |
How could I miss the change in scipy versions... |
Thanks very much for spotting the root cause @antoinebaker. Could you please try to craft a minimal reproducer for scipy that does not rely on scikit-learn? Meanwhile I will change our test suite to mark this case as XFAIL. |
That should not be a problem. The |
To reproduce locally, you can
then you can access your host folder under the Off course you have to reinstall all the build dependencies (gcc and co) using |
I wonder why this did not fail in the lockfile update PR though... EDIT: this is because we install |
@lesteve @antoinebaker actually the pyodide build is running on an older scipy release (scipy 1.14.1) than the https://github.com/scikit-learn/scikit-learn/actions/runs/14163691264/job/39673230652#step:3:1863 So it's very likely that this will crash once the pyodide distribution gets updated. So performing a root cause analysis / minimal reproducer to report upstream is likely useful. |
I confirm that we can reproduce the bug on a container with 32bit debian and scipy 1.15.2, no bug for scipy 1.14.1. Currently the test is close to our sample weight test import numpy as np
from scipy.optimize import nnls
def preprocess(X, y, sw):
if sw is None:
sw = np.ones(len(y))
X_mean = np.average(X, weights=sw, axis=0)
y_mean = np.average(y, weights=sw)
X_scaled = np.sqrt(sw)[:, np.newaxis] * (X - X_mean)
y_scaled = np.sqrt(sw) * (y - y_mean)
return X_scaled, y_scaled
def linreg(X, y, sw):
X_scaled, y_scaled = preprocess(X, y, sw)
return nnls(X_scaled, y_scaled)[0]
def test_linreg_sample_weight_equivalence():
rng = np.random.RandomState(42)
n_samples = 15
X = rng.rand(n_samples, n_samples * 2)
y = rng.randint(0, 3, size=n_samples)
sw = rng.randint(0, 5, size=n_samples)
# repeated
X_repeated = X.repeat(repeats=sw, axis=0)
y_repeated = y.repeat(repeats=sw)
# compare
coef_weighted = linreg(X, y, sw)
coef_repeated = linreg(X_repeated, y_repeated, None)
assert np.allclose(coef_weighted, coef_repeated) Should I try to come up with a simpler test for the reproducer ? |
Nice!
Ideally yes? Maybe there is a way to show that the output of Out of curiosity: I guess you are compiling scipy from source to get 1.14.1 inside the Docker 32 bit image? |
It also looks like NNLS has been rewritten in C very recently scipy/scipy#22524 so I guess you may want to build scipy from the |
I used venv to pip install specific versions of scipy # Use a Debian base image
FROM debian:trixie
# Update the package list
RUN apt update
# Install scipy dependencies
RUN apt install -y build-essential gcc g++ gfortran libopenblas-dev liblapack-dev pkg-config python3-pip python3-dev python3-venv
# Install specific version of scipy
RUN python3 -m venv opt/scipy1.14.1
RUN opt/scipy1.14.1/bin/pip install scipy==1.14.1
RUN python3 -m venv opt/scipy1.15.2
RUN opt/scipy1.15.2/bin/pip install scipy==1.15.2 |
Unfortunately the problem remains on scipy |
As you know, I have been rewriting a lot of F77 SciPy code in C (essentially from one evil to a more familiar evil) and calling up-to-date LAPACK routines instead of the ancient hand-rolled versions of these routines. However an immediate downside of this is that the problems that have been hiding in the dark is now coming out as I call more routines from C directly. One thing, I am noticing is that the So I am willing to bet on the LAPACK backend that is the culprit. But I can't figure out how to prove this claim. In the meantime I'll add your example to our test suite and see how it fares in our 32bit run. |
Nevermind we got hit too. So this is on me. Sorry for the noise. |
Hi again, just merged a fix for the reproducer and hoping that it also fixes the actual issue in the test suite. Could you please retry with the new main and let me know? |
Here is the last scheduled run (from 1 day ago) that passed:
https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=75127&view=logs&j=86340c1f-3d76-5202-0821-7817a0f52092&t=a73eff7b-829e-5a65-7648-23ff8e83ea2d
and here is a more recent run that failed (all CI is failing today):
https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=75179&view=logs&j=86340c1f-3d76-5202-0821-7817a0f52092&t=a73eff7b-829e-5a65-7648-23ff8e83ea2d
Full failure log:
Looking at the software runtime info of each I only see two differences:
All other dependencies seem to match, including the openblas version inspected by threadpoolctl.
EDIT: this is wrong, the scipy version is not the same and I missed it.
The text was updated successfully, but these errors were encountered: