-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA add newton-lsmr solver to LogisticRegression and GLMs #25462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEA add newton-lsmr solver to LogisticRegression and GLMs #25462
Conversation
@ogrisel @mathurinm @TomDLT @agramfort @rth might be interested as this seems to be new ground for GLM solvers, especially the multinomial logistic regression! It was a very stony path to arrive with all (added) tests green. Right now, I've no energy to do extensive benchmarking. But I hope, that this work will become useful and find its way into scikit-learn, in the end. I'm sure, there are opportunities left for performance optimization. |
Glad to see this! I just re-ran the previous benchmark for Poisson regression on the French MTPL dataset from the previous PR: and here are the results on my laptop: so this looks very good. |
I have adapted the above benchmark to turn it into an imbalanced multiclass classification problem by binning the target. Since 0 is overly represented, when choosing a large number of bins and the quantile strategy, many bins are collapsed to 0. Here is the code: import warnings
from pathlib import Path
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib
m = joblib.Memory(location=".", verbose=0)
@m.cache
def prepare_data():
df = fetch_openml(data_id=41214, as_frame=True, parser='auto').frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
(
"binned_numeric",
KBinsDiscretizer(n_bins=10, subsample=None),
["VehAge", "DrivAge"],
),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df)
return X, y, w
X, y_orig, w = prepare_data()
print("binning the target...")
binner = KBinsDiscretizer(
n_bins=300, encode="ordinal", strategy="quantile", subsample=int(2e5), random_state=0
)
y = binner.fit_transform(y_orig.to_numpy().reshape(-1, 1)).ravel().astype(np.int32)
# X = X.toarray()
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
X, y, w, train_size=10_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")
print("y_train.value_counts() :")
print(pd.Series(y_train).value_counts())
results = []
slow_solvers = set()
for tol in np.logspace(-1, -10, 10):
for solver in ["lbfgs", "newton-cg", "newton-lsmr"]:
if solver in slow_solvers:
# skip slow solvers to keep the benchmark runtime reasonable
continue
tic = perf_counter()
# with warnings.catch_warnings():
# warnings.filterwarnings("ignore", category=ConvergenceWarning)
clf = LogisticRegression(
C=1e12, solver=solver, tol=tol, max_iter=10000
).fit(X_train, y_train)
toc = perf_counter()
train_time = toc - tic
if train_time > 200:
# skip this solver from now on...
slow_solvers.add(solver)
# TODO: handle the regularization term...
train_loss = log_loss(y_train, clf.predict_proba(X_train))
n_iter = clf.n_iter_[0]
result = {
"solver": solver,
"tol": tol,
"train_loss": train_loss,
"train_time": train_time,
"train_score": clf.score(X_train, y_train),
"test_score": clf.score(X_test, y_test),
"n_iter": n_iter,
"converged": n_iter < clf.max_iter,
}
print(result)
results.append(result)
results = pd.DataFrame.from_records(results)
filepath = Path().resolve() / "bench_multinomial_logistic_regression_mtpl.csv"
results.to_csv(filepath)
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
results = pd.read_csv(filepath)
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-15
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(
x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax
)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by iterations")
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(
x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax
)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by time")
plt.show() DISCLAIMER: the plot above displays the unpenalized EDIT: I did another run with This task is very challenging for all solvers and I had to decrease the number of samples to get it run in a reasonable time on my laptop. I also stopped recording solver when tol decreases to the point where a single fit would last more than a few minutes. Here are the resulting plots: Note that the handling of the stopping criterion of LBFGS is not working properly for the
Note that for lower tolerance values, the above snippet can trigger: /Users/ogrisel/code/scikit-learn/sklearn/linear_model/_linear_loss.py:867: RuntimeWarning: divide by zero encountered in divide
fj = self.p[:, i] / (self.q[:, i - 1] + mask)
/Users/ogrisel/code/scikit-learn/sklearn/linear_model/_linear_loss.py:873: RuntimeWarning: invalid value encountered in add
x[:, i] += fj * x[:, j] for the Finally, I think it would be interesting to adapt this benchmark to use benchopt and include it in this panel since it's quite challenging for most solvers yet still realistic enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback after a quick glance at the code.
Out of curiosity, have you tried to profile this to pinpoint the bottlenecks for both the multinomial and non-multinomial cases?
I have the impression that multithreading is barely used (in the multiclass case) so it's probably not a few large BLAS calls as is the case for newton-cholesky
.
@@ -516,3 +532,460 @@ def inner_solve(self, X, y, sample_weight): | |||
) | |||
self.use_fallback_lbfgs_solve = True | |||
return | |||
|
|||
|
|||
class NewtonLSMRSolver(NewtonSolver): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling the code would be simpler to follow if we had a special subclass to handle the if self.linear_loss.base_loss.is_multiclass
case.
That would require introducing a factory function to do the dispatching to the correct solver class based on the loss object but I have the feeling that would be worth it.
For the multinomial/multiclass case, it clearly is Edit: I was able to significantly speed up those 2 functions in e5f5f48. They are still the bottleneck, but much reduced (~2x). |
885b413
to
d0eea42
Compare
- keep dtype float32 after LSMR - lower test precision in test_NewtonLSMRSolver_multinomial_A_b_on_3_classes
d0eea42
to
7ef5877
Compare
With the latest improvements it looks a bit better (btw Sparse X (as above)Dense XConclusionSo this solver can be used for very fast but rough estimates or for high precision estimates:smirk: Code for reproducibility: import warnings
from pathlib import Path
import numpy as np
from scipy import sparse
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib
m = joblib.Memory(location=".", verbose=0)
@m.cache
def prepare_data():
df = fetch_openml(data_id=41214, as_frame=True, parser='auto').frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
(
"binned_numeric",
KBinsDiscretizer(n_bins=10, subsample=None),
["VehAge", "DrivAge"],
),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df)
return X, y, w
X, y_orig, w = prepare_data()
print("binning the target...")
binner = KBinsDiscretizer(
n_bins=300, encode="ordinal", strategy="quantile", subsample=int(2e5), random_state=0
)
y = binner.fit_transform(y_orig.to_numpy().reshape(-1, 1)).ravel().astype(np.int32)
# X = X.toarray()
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
X, y, w, train_size=10_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")
print(f"{sparse.issparse(X_train)=}")
print("y_train.value_counts() :")
print(pd.Series(y_train).value_counts())
results = []
slow_solvers = set()
for tol in np.logspace(-1, -10, 10):
for solver in ["lbfgs", "newton-cg", "newton-lsmr"]:
if solver in slow_solvers:
# skip slow solvers to keep the benchmark runtime reasonable
continue
tic = perf_counter()
# with warnings.catch_warnings():
# warnings.filterwarnings("ignore", category=ConvergenceWarning)
clf = LogisticRegression(
C=1e12, solver=solver, tol=tol, max_iter=10000
).fit(X_train, y_train)
toc = perf_counter()
train_time = toc - tic
if train_time > 200:
# skip this solver from now on...
slow_solvers.add(solver)
# TODO: handle the regularization term...
train_loss = log_loss(y_train, clf.predict_proba(X_train))
n_iter = clf.n_iter_[0]
result = {
"solver": solver,
"tol": tol,
"train_loss": train_loss,
"train_time": train_time,
"train_score": clf.score(X_train, y_train),
"test_score": clf.score(X_test, y_test),
"n_iter": n_iter,
"converged": n_iter < clf.max_iter,
}
print(result)
results.append(result)
results = pd.DataFrame.from_records(results)
filepath = Path().resolve() / "bench_multinomial_logistic_regression_mtpl.csv"
results.to_csv(filepath)
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
results = pd.read_csv(filepath)
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-15
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(
x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax
)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by iterations")
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(
x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax
)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by time")
plt.show() |
@lorentzenchr I am not sure if you saw but your last push triggered a CI failure. I did not investigate myself but I wanted to make sure that it not go unnoticed. |
Looking at the last plot, I wonder why the LMSR-based solver seems to slow down after the first 4 iterations, before it accelerates in the last two again. Perhaps, the choice of |
fa2586f
to
0dc87cf
Compare
The remaining CI error will be automatically fixed by setting scipy>=1.4, see scipy/scipy#7396. Note that the transpose is only taken in a few tests, the solver itself works fine with those older scipy versions. |
For reference, the bump to scipy>=1.4 in |
CI all 🟢 again. |
Commit 83ce34f passes |
When trying to trigger the LBFGS fallback by trying to make the model extremely confident so has to make the Hessian numerically degenerate I triggered the following
>>> import numpy as np
... from sklearn.linear_model import LogisticRegression
...
... x = np.array([-1e24] * 1 + [1e24] * 1)
... X = x.reshape(-1, 1)
... y = (x > 0).astype(np.int32)
...
... lr = LogisticRegression(solver="newton-lsmr", penalty=None, verbose=100).fit(X, y)
... lr.n_iter_
[...]
Newton iter=26
norm(gradient) = 4170877078229.1255
Traceback (most recent call last):
Cell In[2], line 8
lr = LogisticRegression(solver="newton-lsmr", penalty=None, verbose=100).fit(X, y)
File ~/code/scikit-learn/sklearn/base.py:1148 in wrapper
return fit_method(estimator, *args, **kwargs)
File ~/code/scikit-learn/sklearn/linear_model/_logistic.py:1321 in fit
fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
File ~/code/scikit-learn/sklearn/utils/parallel.py:65 in __call__
return super().__call__(iterable_with_config)
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/parallel.py:1085 in __call__
if self.dispatch_one_batch(iterator):
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/parallel.py:901 in dispatch_one_batch
self._dispatch(tasks)
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/parallel.py:819 in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/_parallel_backends.py:208 in apply_async
result = ImmediateResult(func)
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/_parallel_backends.py:597 in __init__
self.results = batch()
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/parallel.py:288 in __call__
return [func(*args, **kwargs)
File ~/mambaforge/envs/dev/lib/python3.11/site-packages/joblib/parallel.py:288 in <listcomp>
return [func(*args, **kwargs)
File ~/code/scikit-learn/sklearn/utils/parallel.py:127 in __call__
return self.function(*args, **kwargs)
File ~/code/scikit-learn/sklearn/linear_model/_logistic.py:485 in _logistic_regression_path
w0 = sol.solve(X=X, y=target, sample_weight=sample_weight)
File ~/code/scikit-learn/sklearn/linear_model/_glm/_newton_solver.py:426 in solve
self.inner_solve(X=X, y=y, sample_weight=sample_weight)
File ~/code/scikit-learn/sklearn/linear_model/_glm/_newton_solver.py:974 in inner_solve
atol=eta * norm_G / (self.A_norm * self.r_norm),
ZeroDivisionError: float division by zero We might need a |
Also note that |
Actually there is a problem with lbfgs on the above problem, it does not converge:
and
while I don't know if the failure of lbfgs here should be considered a bug or not but since lbfgs is our robust fallback, this might be a problem :) Maybe LBFGS should try a simple gradient step when the line search fails? |
I found a way to trigger the LBFGS fallback for the multiclass case: import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([-1e24] * 1 + [1e24] * 2)
X = x.reshape(-1, 1)
y = np.asarray([0, 1, 2])
lr = LogisticRegression(solver="newton-lsmr", penalty=None, verbose=100).fit(X, y) |
I opened #26707 for investigating the inner solver stopping criterion and run a log of benchmarks. There is no clear winner. I have to leave it as is: Either it is good enough in it's current shape or someone else needs to dig deeper. My conclusion is that we have quite some room of improvement of the current solvers, like #24752. Also the |
There are cases where it's indeed quite impressive based on the last benchmarks that are now collapsed in the discussion. But I agree that fixing #24752 would be helpful to get a clearer picture. Also based on benchopt, it seems that SAG & SAGA are better reference solvers for 20 newsgroups, see e.g.: https://benchopt.github.io/results/preprint_results_preprint_results_logreg_l2.html I have no intuition on why this should be the case. You'll have to switch the dataset in the menu on the left to see the results on 20 newsgroups. UPDATE: actually the results on this dataset are completely different with and without scaling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments I wrote a few months ago (they might not all be relevant).
if self.p.dtype == np.float32: | ||
eps = 2 * np.finfo(np.float32).resolution | ||
else: | ||
eps = 2 * np.finfo(np.float64).resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.p.dtype == np.float32: | |
eps = 2 * np.finfo(np.float32).resolution | |
else: | |
eps = 2 * np.finfo(np.float64).resolution | |
eps = 2 * np.finfo(self.p.dtype).resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure that proba
and hence self.p
is always either float32 or float64.
elif solver == "newton-lsmr": | ||
clf.set_params(tol=1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adaptation makes me think of an UX consideration: do we want to have the tolerance settable by the choice of the solver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind?
Reference Issues/PRs
Supersedes #23507.
Fixes #16634.
What does this implement/fix? Explain your changes.
This PR adds a further
NewtonSolver
:NewtonLSMRSolver
.This solver uses the iteratively reweighted least squares (IRLS) formulation of a Newton step. This means the inner solver uses the square root of the Hessian and solves the corresponding least squares problem (as opposed to solving the normal equation as
"newton-cholesky"
is doing) with the iterative LSMR solver.This solver is therefore suited for dense and sparse
X
.Any other comments?
The multinomial/multiclass case deserves special attention as there are different ways to look at the Hessian$X' W X$ :
n_classes=3
n_samples
. ThenNow, one can use the LDL decomposition of this particular matrix
This is the chosen approach.
Edit: For benchmarks, see #25462 (comment).