8000 PoissonRegressor lbfgs solver giving coefficients of 0 and Runtime Warning · Issue #27016 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PoissonRegressor lbfgs solver giving coefficients of 0 and Runtime Warning #27016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
akaashp2000 opened this issue Aug 4, 2023 · 16 comments · May be fixed by #27332 or #29681
Open

PoissonRegressor lbfgs solver giving coefficients of 0 and Runtime Warning #27016

akaashp2000 opened this issue Aug 4, 2023 · 16 comments · May be fixed by #27332 or #29681

Comments

@akaashp2000
Copy link
akaashp2000 commented Aug 4, 2023

Describe the bug

See the following stack exchange post (the solution to my original issue was to use newton-cholesky solver)

When fitting a Poisson Regression (without regularization) to some dummy data I encounter:

  • with lbfgs solver, a Runtime Warning, a non-zero intercept and all coefficients as zero
  • with newton-cholesky solver, coefficients as expected

Some people on StackExchange have mentioned it is worth submitting an issue (there was a similar one faced with Logistic Regression).

Steps/Code to Reproduce

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from sklearn.linear_model import PoissonRegressor
from sklearn.preprocessing import (
    OneHotEncoder,
)
data = sm.datasets.get_rdataset('Insurance', package='MASS').data
# Fit Poisson regression using formula interface
formula = "Claims ~ C(District, Treatment(1)) + C(Group, Treatment('<1l')) + C(Age, Treatment('<25')) + Holders"
model_smf = smf.poisson(formula=formula, data=data).fit()
print(type(model_smf))
print(model_smf.summary())
# with sklearn OneHotEncoder

X_train_ohe = OneHotEncoder(sparse_output=False, drop=[1, "<1l", "<25"]).fit(data[["District", "Group", "Age"]])
X_train_ohe = pd.DataFrame(X_train_ohe.transform(data[["District", "Group", "Age"]]), columns=X_train_ohe.get_feature_names_out())

X_train = pd.concat([X_train_ohe, data[["Holders"]]], axis=1)
y_train = data["Claims"]

# one-hot encode the categorical columns, and drop the baseline column
# with lbfgs solver

model_sklearn_lbfgs = PoissonRegressor(alpha=0).fit(X_train, y_train)

print(model_sklearn_lbfgs.intercept_)
print(model_sklearn_lbfgs.coef_)
# with newton-cholesky solver

model_sklearn_nc = PoissonRegressor(alpha=0, solver='newton-cholesky').fit(X_train, y_train)

print(model_sklearn_nc.intercept_)
print(model_sklearn_nc.coef_)

Expected Results

Expected intercept/coefficients (from statsmodels and sklearn with newton-cholesky solver):

2.8456859090224444
[-4.47436243e-01 -9.30629212e-01 -1.46547942e+00  1.00603354e+00
  4.71328611e-01 -5.98213625e-01  5.69685530e-01  6.85301286e-01
  2.22504699e+00 -1.58161172e-05]

Actual Results

Result with lbfgs:

3.896592058397603
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
/home/akaash/Downloads/statsmodels-play/venv/lib/python3.10/site-packages/sklearn/linear_model/_linear_loss.py:290: RuntimeWarning: invalid value encountered in matmul
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights

Versions

System:
    python: 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
executable: /home/akaash/Downloads/statsmodels-play/venv/bin/python
   machine: Linux-5.19.0-46-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 68.0.0
        numpy: 1.25.1
        scipy: 1.11.1
       Cython: None
       pandas: 2.0.3
   matplotlib: 3.7.2
       joblib: 1.3.1
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 6
         prefix: libopenblas
       filepath: /home/akaash/Downloads/statsmodels-play/venv/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-7a851222.3.23.so
        version: 0.3.23
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 6
         prefix: libopenblas
       filepath: /home/akaash/Downloads/statsmodels-play/venv/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 6
         prefix: libgomp
       filepath: /home/akaash/Downloads/statsmodels-play/venv/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
@akaashp2000 akaashp2000 added Bug Needs Triage Issue requires triage labels Aug 4, 2023
@Micky774
Copy link
Contributor

@lorentzenchr wondering if you may have some insight on this

@akaashp2000
Copy link
Author
akaashp2000 commented Aug 11, 2023

Quick update on this:

It seems the issue is not scikit-learn specific, but rather it is specific to lbfgs solver. Even with statsmodels, fitting a Poisson GLM using lbfgs gives errors

# Fit Poisson regression using formula interface WITH lbfgs solver
import statsmodels.formula.api as smf
data = sm.datasets.get_rdataset('Insurance', package='MASS').data
formula = "Claims ~ C(District, Treatment(1)) + C(Group, Treatment('<1l')) + C(Age, Treatment('<25')) + Holders"

model_smf_lbfgs = smf.poisson(formula=formula, data=data).fit(method="lbfgs")
print(type(model_smf_lbfgs))
print(model_smf_lbfgs.summary())

gives the following error too:

At X0         0 variables are exactly at the bounds

At iterate    0    f=  3.19136D+01    |proj g|=  9.62165D+04

At iterate    1    f=  1.80465D+01    |proj g|=  6.63668D+03

At iterate    2    f=  1.79172D+01    |proj g|=  3.18678D+03

At iterate    3    f=  1.78811D+01    |proj g|=  2.51280D+02

At iterate    4    f=  1.78809D+01    |proj g|=  2.15869D+01

At iterate    5    f=  1.78809D+01    |proj g|=  2.15982D+01

At iterate    6    f=  1.78807D+01    |proj g|=  1.44105D+02

At iterate    7    f=  1.78802D+01    |proj g|=  3.22683D+02
This problem is unconstrained.
ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "

@lorentzenchr
Copy link
Member

@akaashp2000 Could you run it with verbose=3 or higher? It seems like the coefficients get huge, nan or inf.

@akaashp2000
Copy link
Author

@lorentzenchr sorry for the late response on this.

With the same X_train and y_train as in the original post, I ran with verbose = 10

Code:

# one-hot encode the categorical columns, and drop the baseline column
# with lbfgs solver

verbose = 10

model_sklearn_lbfgs_verbose = PoissonRegressor(alpha=0, verbose=verbose).fit(X_train, y_train)

print("verbose", verbose)
print(model_sklearn_lbfgs_verbose.intercept_)
print(model_sklearn_lbfgs_verbose.coef_)

Output:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           11     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f= -1.42612D+02    |proj g|=  4.30046D+04
verbose 10
3.896592058397603
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
   11      1      3      1     0     0   4.300D+04  -1.426D+02
  F =  -142.61189962516949     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             

For different values of verbose (I tried 3 and 50 as well) the output was identical (aside from the line for print("verbose", verbose))

@ogrisel
Copy link
Member
ogrisel commented Aug 25, 2023

Could you print the value of grad_pointwise (by editing /home/akaash/Downloads/statsmodels-play/venv/lib/python3.10/site-packages/sklearn/linear_model/_linear_loss.py:290) at each iteration?

@ogrisel
Copy link
Member
ogrisel commented Aug 25, 2023

I tried the reproducer with verbose=100 and I get a different output:

>>> model_sklearn_lbfgs = PoissonRegressor(alpha=0, verbose=100).fit(X_train, y_train)
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           11     M =           10
 This problem is unconstrained.

At X0         0 variables are exactly at the bounds

At iterate    0    f= -1.42612D+02    |proj g|=  4.30046D+04


ITERATION     1

---------------- CAUCHY entered-------------------
 There are            0   breakpoints 

 GCP found in this segment
Piece      1 --f1, f2 at start point  -1.8494D+09  1.8494D+09
Distance to the stationary point =   1.0000D+00

---------------- exit CAUCHY----------------------

          11  variables are free at GCP            1
/Users/ogrisel/mambaforge/envs/tmp/lib/python3.11/site-packages/sklearn/linear_model/_linear_loss.py:290: RuntimeWarning: invalid value encountered in matmul
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
 LINE SEARCH           1  times; norm of step =    0.0000000000000000     

At iterate    1    f= -1.42612D+02    |proj g|=  4.30046D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
   11      1      3      1     0     0   4.300D+04  -1.426D+02
  F =  -142.61189962516949     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH

Note that the warnings happens after the CAUCHY subroutine.

I printed the grad_pointwise values and there are indeed infinite values there.

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Aug 25, 2023
@ogrisel
Copy link
Member
ogrisel commented Aug 25, 2023

I have tried to change the penalty (alpha) and we still get non-finite gradients.

Not sure what the fix would be but we could at least raise an error or a convergence warning with a more user friendly error message if intermediate computation produce non-finite gradient values.

@lorentzenchr
Copy link
Member

I suspect that the columns "Holders" with a max value of 3e3 produces an overflow when exponentiated. If that's true, there is not much we can do about it.
A user can

  • use a pipeline with a StandardScaler
  • use a different solver
  • use a different package like glum (which defaults to the newton cholesky like solver)

@ogrisel
Copy link
Member
ogrisel commented Aug 25, 2023

We could detect the non-finite gradient values and issue a ConvergenceWarning to suggest to scale the features (and target?) and/or try alternative solvers.

@akaashp2000
Copy link
Author

@ogrisel

  • yes I am in fact getting the same output with verbose = 100 as you
  • ... and the same issue of infinite values for grad_pointwise

following your suggestion to print the value at each iteration, I added a line print(gradient_pointwise) at Line 290 of _linear_loss.py. Without any verbose value specified, I found that the first value printed had reasonable coordinate values, then the second value had very large/infinite coordinate values, then it reverted to the first value of grad_pointwise, and then printed the intercept (3.896...) and coefficients (all zeroes)

@lorentzenchr

I have now tried fitting the GLM with LBFGS, now on the same X_train but with the "Holders" column scaled, i.e.

X_train_scaled = X_train.copy()
X_train_scaled["Holders"] = X_train_scaled["Holders"]/X_train_scaled["Holders"].max()

and it does indeed converge now! And to the same coefficients as from solver='newton-cholesky' (aside from the scaling of 'Holders' coefficient due to the scaling of the feature). Scaling the feature works for both statsmodels and scikit-learn now.

@akaashp2000
Copy link
Author

I would be happy to help by adding a ConvergenceWarning and/or suggestion to scale features/use another solver as @ogrisel suggested, should any infinite values occur in grad_pointwise.

My initial thought is this could be done in the LinearModelLoss class' loss_gradient, immediately after loss_gradient is defined. What are your thoughts, or is there a better approach?

@lorentzenchr
Copy link
Member

An explicit check for finiteness could impact the fit time. Catching the runtime warning could be a way to go.

@ogrisel
Copy link
Member
ogrisel commented Aug 26, 2023

We could indeed force numpy to raise an exception during the matmul and rewrap it with a ValueError with a more informative error message.

with np.errstate(all='raise'):
    # do the matmul here
    ...

https://numpy.org/doc/stable/reference/generated/numpy.errstate.html#numpy.errstate

@akaashp2000
Copy link
Author

@ogrisel

I am trying the approach you have suggested above - from what I understand using np.errstate means that instead of the RuntimeWarning, a (FloatingPointError) exception is raised, so I have tried the following change: where the matmul was before, I now have:

with np.errstate(all="raise"):
    try:
        grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
    except FloatingPointError:
        raise ValueError(
            "Overflow detected. Try scaling the target variable or"
            " features, or using a different solver"
        ) from None

(I believe the None surpresses the original FloatingPointErrror, so only the ValueError is returned... below is the output)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 4
      1 # one-hot encode the categorical columns, and drop the baseline column
      2 # with lbfgs solver
----> 4 model_sklearn_lbfgs = PoissonRegressor(alpha=0).fit(X_train, y_train)
      6 print(model_sklearn_lbfgs.intercept_)
      7 print(model_sklearn_lbfgs.coef_)

File ~/Downloads/scikit-learn/sklearn/base.py:1215, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
   1208     estimator._validate_params()
   1210 with config_context(
   1211     skip_parameter_validation=(
   1212         prefer_skip_nested_validation or global_skip_validation
   1213     )
   1214 ):
-> 1215     return fit_method(estimator, *args, **kwargs)

File ~/Downloads/scikit-learn/sklearn/linear_model/_glm/glm.py:265, in _GeneralizedLinearRegressor.fit(self, X, y, sample_weight)
    262 if self.solver == "lbfgs":
    263     func = linear_loss.loss_gradient
--> 265     opt_res = scipy.optimize.minimize(
    266         func,
    267         coef,
    268         method="L-BFGS-B",
...
    297         ) from None
    298 if self.fit_intercept:
    299     grad[-1] = grad_pointwise.sum()

ValueError: Overflow detected. Try scaling the target variable or features, or using a different solver

Is this what you meant? Could something similar could be done for multiclass case too (i.e. when self.base_loss.is_multiclass is True)?

@lorentzenchr
Copy link
Member

@akaashp2000 Yes, such an error message is much more informative. Could you also check that actually scaling first does solve the problem? If so, then a PR for the improved error message would be much appreciated.

@stes
Copy link
Contributor
stes commented Aug 16, 2024

I gave this another shot and made a suggestion in #29681, based on @akaashp2000 's PR, but including tests.

However, I found another edge case that is not detectable (see https://github.com/scikit-learn/scikit-learn/pull/29681/files#r1719809543), to repro:

import sys
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from sklearn.linear_model import PoissonRegressor
from sklearn.preprocessing import (
    OneHotEncoder,
)

data = sm.datasets.get_rdataset('Insurance', package='MASS').data

X_train_ohe = OneHotEncoder(sparse_output=False, drop=[1, "<1l", "<25"]).fit(data[["District", "Group", "Age"]])
X_train_ohe = pd.DataFrame(X_train_ohe.transform(data[["District", "Group", "Age"]]), columns=X_train_ohe.get_feature_names_out())

# NOTE: Use just the last colum here
X_train = data[["Holders"]]
y_train = data["Claims"]

model_sklearn_lbfgs = PoissonRegressor(alpha=0, tol = 1e-8, max_iter = 10000, verbose = True).fit(X_train, y_train)
print(model_sklearn_lbfgs.intercept_)
print(model_sklearn_lbfgs.coef_)

# with newton-cholesky solver
model_sklearn_nc = PoissonRegressor(alpha=0, solver='newton-cholesky', tol = 1e-8, max_iter = 1000).fit(X_train, y_train)
print(model_sklearn_nc.intercept_)
print(model_sklearn_nc.coef_)

which gives

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            2     M =           10
 This problem is unconstrained.

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    2      1      3      1     0     0   4.300D+04  -1.426D+02

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
3.896592058397603
[0.]
3.2885810888136406
[0.00087977]

How to proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
0