Inconsistent Results For Logistic Regressions across multiple computers #24615

NickBrecht · 2022-10-10T04:18:07Z

Describe the bug

Hey all -

I'm working on teaching some students some logistic regressions and noticed different computers can produce slightly different intercepts/coefs. At first I thought it was maybe environment differences, but I have been able to reproduce the variances when accounting for the various packages/interpreters.

I can create a new environment on computer 1 (windows 10, intel 8th gen CPU), computer 2 (windows 10, intel 11th gen CPU), and a coworker's M1 MacBook - all three produce different results. I thought maybe minute differences in Numpy's Openblas or MKL could be the culprit but that yielded the same varying results. I've tried not splitting the data, different random states, different C values, different solvers...

For the purposes of reproducibility, I ran this is exact code & uploaded the data to Google sheets:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

bank_df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSGfjG4mq1_4HS4iwRN7EZK6YHzPDi8HpB_giY7kiqbDZRsRNjbfhuQ2J6xkHGk1YVYN9H0TxOf2tgw/pub?gid=1909291157&single=true&output=csv')
bank_df.drop(columns=['ID', 'ZIP_Code'], inplace=True)

bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
bank_df.Education.cat.rename_categories(new_categories, inplace=True)
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True)

y = bank_df['Personal_Loan']
X = bank_df.drop(columns=['Personal_Loan'])

# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

# fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization)
logit_reg = LogisticRegression(penalty="l2", C=1e42, solver='liblinear', random_state=0)
logit_reg.fit(train_X, train_y)

All versions of all other packages are held constant. I received this results across the three computers:

values	Computer 1 with Sklearn 1.1.2	Computer 2 with Sklearn 1.1.2	M1 Macbook with Sklearn 1.1.1
Intercept	-12.4919578542609	-12.6123809362746	-12.4934360611768
Age coef	-0.037738	-0.032577	-0.037685
Experience coef	0.039255	0.034171	0.039202
Income coef	0.058843	0.058837	0.058844
Family coef	0.612243	0.613192	0.612251

I have tried many other things that all produce slightly different results that are not captured in the above table. Multicollinearity is present within the data, but I would still expect consistent results.

I understand the results are effectively the same and this produces little real world impacts. However, when a student is comparing results and they see 12.4 vs 12.6 which is a decently large difference. I think some of these differences are too large to just be contributed to floating-point computations -- especially for size of the data. I assume I'm missing something, but I'm reaching the end of my rope on troubleshooting. Thoughts / expected behavior?

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

bank_df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSGfjG4mq1_4HS4iwRN7EZK6YHzPDi8HpB_giY7kiqbDZRsRNjbfhuQ2J6xkHGk1YVYN9H0TxOf2tgw/pub?gid=1909291157&single=true&output=csv')
bank_df.drop(columns=['ID', 'ZIP_Code'], inplace=True)

bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
bank_df.Education.cat.rename_categories(new_categories, inplace=True)
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True)

y = bank_df['Personal_Loan']
X = bank_df.drop(columns=['Personal_Loan'])

# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

# fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization)
logit_reg = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
logit_reg.fit(train_X, train_y)

Expected Results

values	Computer 1 with Sklearn 1.1.2	Computer 2 with Sklearn 1.1.2	M1 Macbook with Sklearn 1.1.1
Intercept	-12.4919578542609	-12.6123809362746	-12.4934360611768
Age coef	-0.037738	-0.032577	-0.037685
Experience coef	0.039255	0.034171	0.039202
Income coef	0.058843	0.058837	0.058844
Family coef	0.612243	0.613192	0.612251

Actual Results

n/a

Versions

System:
    python: 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\Nick\.conda\envs\sktest\python.exe
   machine: Windows-10-10.0.19044-SP0

Python dependencies:
      sklearn: 1.1.2
          pip: 22.2.2
   setuptools: 63.4.1
        numpy: 1.23.1
        scipy: 1.9.1
       Cython: None
       pandas: 1.4.4
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\Nick\.conda\envs\sktest\Lib\site-packages\sklearn\.libs\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 12

       filepath: C:\Users\Nick\.conda\envs\sktest\Library\bin\mkl_rt.1.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 6
threading_layer: intel

8000

glemaitre · 2022-10-10T13:19:59Z

Could you set the random_state since liblinear as some randomness in the algorithm?

NickBrecht · 2022-10-10T17:17:08Z

@glemaitre Of course - I should've included that in the code example.

Setting random_state = 42 and removing the test_train_split to just pass the entire data, I get these results:

values	Computer 1	Computer 2
Intercept	-11.494120788531475	-12.301872604999168
Age	-0.064421	-0.035942
Experience	0.072901	0.045021
Income	0.059754	0.060148
Family	0.611814	0.617678

lorentzenchr · 2022-10-15T20:49:03Z

Could you try with a smaller tol, 1e-6 or smaller?

glemaitre · 2022-10-25T17:25:28Z

I decreased the tol to 1e-8 and I can see the non-deterministic results between a Linux machine and a MacOS machine.

The number of iterations is also not the same, and when it is the same, the results are different.

When the regularization increases, this behaviour will disappear.

Note that this behaviour is also present with "lbfgs" solver.

lorentzenchr · 2022-10-25T17:37:24Z

What happens with C=np.inf instead of C=1e42?

glemaitre · 2022-10-25T17:46:04Z

What happens with C=np.inf instead of C=1e42?

It doesn't converge :)

With LBFGS, since you have the verbose, you can see small numerical differences (in term of objective and gradient) at each iteration that leads the coefficients to diverge after a while.

MujassimJamal · 2022-12-19T03:14:41Z

@NickBrecht, As liblinear approximate weights to minimize the cost function, getting different results might be possible due to small mathematical round off errors at the time of convergence. It also depends on

Different computational environments,

Number of iterations,

And the learning rate.

In the case of Tensorflow, We may directly use logits in cost function to avoid this type of round off errors.

NickBrecht · 2022-12-20T06:37:33Z

@MujassimJamal thanks for the input. I still am incredibly suspicious that differences in approximated weights & rounding could yield the nontrivial differences in the numbers above. If we saw a difference in the hundred thousandths then sure, but a .1-.2+ difference seems suspect. Do you feel otherwise?

Iterations/learning rate were account for in many, many tests (and @glemaitre's test seemingly). I attempted to account for computational environments as stated in my original post. I'm aware that differences in the actual CPU architecture could yield some infinitely small differences, but same conda, python, blas, mkl, etc.

lorentzenchr · 2022-12-20T11:39:43Z

I can‘t test for myself right now. Let‘s exclude collinearity and perfect separability:

@NickBrecht mentioned collinearity. What‘s the smallest eigenvalue of X? While the weights/coefficients are different on different computers, do the predictions (predict_proba on the train set) also differ?
It seems unlikely and I don‘t know how to systematically test this, but are there feature combinations (subspaces if X) such that y is always the same class label (either always 0 or always 1)?

lorentzenchr · 2024-04-10T18:40:39Z

Maybe the problem/target is separable, see also #18264.
I‘ll close as no activity and no minimal reproducer (I won’t download arbitrary cloud files).

NickBrecht added Bug Needs Triage Issue requires triage labels Oct 10, 2022

glemaitre added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Oct 18, 2022

cmarmo added the module:linear_model label Jan 17, 2023

lorentzenchr closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Results For Logistic Regressions across multiple computers #24615

Inconsistent Results For Logistic Regressions across multiple computers #24615

Inconsistent Results For Logistic Regressions across multiple computers #24615

Inconsistent Results For Logistic Regressions across multiple computers #24615

Comments

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions