-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA add GLM Newton-LSMR Solver on top of Newton-Cholesky #23507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some benchmarkAs in #23314. import numpy as np
from sklearn._loss import HalfPoissonLoss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.linear_model import PoissonRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
("binned_numeric", KBinsDiscretizer(n_bins=10, subsample=None), ["VehAge", "DrivAge"]),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df) Note that %time PoissonRegressor(solver="lbfgs", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w); None Wall time: 5.01 s %time PoissonRegressor(solver="newton-cholesky", alpha=1e-4, tol=1e-4).fit(X, y, sample_weight=w); None Wall time: 1.98 s %time PoissonRegressor(solver="newton-lsmr", alpha=1e-4, tol=1e-4).fit(X, y, sample_weight=w); None Wall time: 1.1 s On dense data with X = X.toarray() I get
They achieve different losses and gradients: # loss
reg_lbfgs = PoissonRegressor(solver="lbfgs", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w)
reg_cholesky = PoissonRegressor(solver="newton-cholesky", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w)
reg_lsmr = PoissonRegressor(solver="newton-lsmr", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w)
lml = LinearModelLoss(base_loss=HalfPoissonLoss(), fit_intercept=True)
for reg, solver in [(reg_lbfgs, "lbfgs"), (reg_cholesky, "cholesky"), (reg_lsmr, "lsmr")]:
this_loss = lml.loss(
coef=np.r_[reg.coef_, reg.intercept_],
X=X,
y=np.asarray(y),
sample_weight=np.asarray(w) / np.sum(w),
l2_reg_strength=1e-4,
)
this_grad = lml.gradient(
coef=np.r_[reg.coef_, reg.intercept_],
X=X,
y=np.asarray(y),
sample_weight=np.asarray(w) / np.sum(w),
l2_reg_strength=1e-4,
)
print(f"solver={solver:<8} has loss = {this_loss:0.10f} "
f"gradient 2-norm = {np.linalg.norm(this_grad):0.10f} "
f"gradient inf-norm = {np.linalg.norm(this_grad, ord=np.inf)}"
)
|
Interesting! Since the lsmr sover is the fastest but stops at a higher loss, maybe the stopping criterion are not directly comparable. Could you please run benchmarks with a grid of tolerance values to check if one solver is Pareto optimal (uniformly dominates the others on this tasks)? Also, you might want to contribute both those solvers and the Poisson regression task to the benchopt project: |
Solver profilesPoisson Regression on French MTPLAs of a115afa. import warnings
from pathlib import Path
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib
def prepare_data():
df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
[
("passthrough_numeric", "passthrough", ["BonusMalus"]),
(
"binned_numeric",
KBinsDiscretizer(n_bins=10, subsample=None),
["VehAge", "DrivAge"],
),
("log_scaled_numeric", log_scale_transformer, ["Density"]),
(
"onehot_categorical",
OneHotEncoder(),
["VehBrand", "VehPower", "VehGas", "Region", "Area"],
),
],
remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df)
return X, y, w
X, y, w = prepare_data()
# X = X.toarray()
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
X, y, w, train_size=100_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")
results = []
loss_sw = np.full_like(y_train, fill_value=(1. / y_train.shape[0]))
# loss_
8000
sw = None
for tol in np.logspace(-1, -10, 10):
for solver in ["lbfgs", "newton-cholesky", "newton-lsmr"]:
tic = perf_counter()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
reg = PoissonRegressor(
alpha=1e-4, solver=solver, tol=tol, max_iter=10000
).fit(X_train, y_train)
toc = perf_counter()
train_time = toc - tic
train_loss = LinearModelLoss(
base_loss=reg._get_loss(), fit_intercept=reg.fit_intercept
).loss(
coef=np.r_[reg.coef_, reg.intercept_],
X=X_train,
y=y_train.values,
l2_reg_strength=reg.alpha,
sample_weight=loss_sw,
)
result = {
"solver": solver,
"tol": tol,
"train_loss": train_loss,
"train_time": train_time,
"train_score": reg.score(X_train, y_train),
"test_score": reg.score(X_test, y_test),
"n_iter": reg.n_iter_,
"converged": reg.n_iter_ < reg.max_iter,
}
print(result)
results.append(result)
results = pd.DataFrame.from_records(results)
results.to_csv(Path().resolve() / "bench_poisson_reg.csv") Plotting import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
results = pd.read_csv(Path().resolve() / "bench_poisson_reg.csv")
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-15
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by iterations")
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by time") Binary Logistic Regression on Kicks DatasetAs of 9de52bf. import warnings
from pathlib import Path
import numpy as np
from sklearn._loss import HalfBinomialLoss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.linear_model._glm.glm import _GeneralizedLinearRegressor
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib
class BinomialRegressor(_GeneralizedLinearRegressor):
def _get_loss(self):
return HalfBinomialLoss()
def prepare_data():
df = fetch_openml(data_id=41162, as_frame=True, parser="auto").frame
linear_model_preprocessor = ColumnTransformer(
[
(
"passthrough_numeric",
make_pipeline(SimpleImputer(), StandardScaler()),
[
"MMRAcquisitionAuctionAveragePrice",
"MMRAcquisitionAuctionCleanPrice",
"MMRCurrentAuctionAveragePrice",
"MMRCurrentAuctionCleanPrice",
"MMRCurrentRetailAveragePrice",
"MMRCurrentRetailCleanPrice",
"MMRCurrentRetailAveragePrice",
"MMRCurrentRetailCleanPrice",
"VehBCost",
"VehOdo",
"VehYear",
"VehicleAge",
"WarrantyCost",
],
),
(
"onehot_categorical",
OneHotEncoder(min_frequency=10),
[
"Auction",
"Color",
"IsOnlineSale",
"Make",
"Model",
"Nationality",
"Size",
"SubModel",
"Transmission",
"Trim",
"WheelType",
],
),
],
remainder="drop",
)
y = np.asarray(df["IsBadBuy"] == "1", dtype=float)
X = linear_model_preprocessor.fit_transform(df)
return X, y
X, y = prepare_data()
# X = X.toarray()
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.9, random_state=0
)
print(f"{X_train.shape = }")
results = []
loss_sw = np.full_like(y_train, fill_value=(1. / y_train.shape[0]))
alpha = 1e-4
for tol in np.logspace(-1, -10, 10):
for solver in ["lbfgs", "newton-cholesky", "newton-lsmr", "logreg-lbfgs", "logreg-newton-cg"]:
tic = perf_counter()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
if solver in ["lbfgs", "newton-cholesky", "newton-lsmr"]:
reg = BinomialRegressor(
alpha=alpha, solver=solver, tol=tol, max_iter=10000
).fit(X_train, y_train)
else:
reg = LogisticRegression(
solver=solver[7:], C=1 / alpha / X.shape[0], tol=tol, max_iter=1000
).fit(X_train, y_train)
toc = perf_counter()
train_time = toc - tic
train_loss = LinearModelLoss(
base_loss=HalfBinomialLoss(), fit_intercept=reg.fit_intercept
).loss(
coef=np.r_[np.squeeze(reg.coef_), reg.intercept_],
X=X_train,
y=y_train,
l2_reg_strength=alpha,
sample_weight=loss_sw,
)
result = {
"solver": solver,
"tol": tol,
"train_loss": train_loss,
"train_time": train_time,
"train_score": reg.score(X_train, y_train),
"test_score": reg.score(X_test, y_test),
"n_iter": np.squeeze(reg.n_iter_),
"converged": np.squeeze(reg.n_iter_) < np.squeeze(reg.max_iter),
}
print(result)
results.append(result)
results = pd.DataFrame.from_records(results)
results.to_csv(Path().resolve() / "bench_logistic_reg.csv") Plotting import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
results = pd.read_csv(Path().resolve() / "bench_logistic_reg.csv")
results["suboptimality"] = results["train_loss"] - results["train_loss
8000
"].min() + 1e-15
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by iterations, sparse X")
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
group.sort_values("tol").plot(x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by time, sparse X") Summary
|
BTW, thanks for the benchmarks, it also looks like a strong contender. It would be interesting to systematically compare all those solvers for various data shapes (n_samples, n_features, n_classes/n_outputs), duplicated/collinear features, regularization and also measure memory usage on dense and sparse data. |
Not sure what's this means. It can handle collinear data as long as |
sklearn/linear_model/_glm/glm.py
Outdated
# Estimated Frobenius norm of A and norm of residual r for tolerance of | ||
# next iteration. | ||
self.A_norm = result[5] | ||
self.r_norm = result[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to issue a warning (and suggest increasing the regularization) if result[1] == 7
(istop
) which means that lsmr
has not converged with the default maxiter(which should be
n_features` by construction in our case), possibly because of ill-conditioning.
Instead of warning might even want to experiment with doing more iterations of lsmr
when we detect this situation according to the docstring of lsmr
, see: https://github.com/scipy/scipy/blob/main/scipy/sparse/linalg/_isolve/lsmr.py
and then only warn if we do not converge with maxiter=2 * n_features
in total for instance (e.g. running lsmr
twice in a row if the first call resulted in istop == 7
).
Note that issuing warning when the inner solver failed to converge while still proceeding with the outer loop might just result in a wall of warnings and a badly converged model. Maybe we could also choose to just early interrupt the outer solver loop with a ConvergenceWarning
in such a case. This is related to: lorentzenchr#6 (that investigates a similar case for the "newton-cholesky"
solver).
…kit-learn into glm_newton_lsmr
- Do not use lbfgs steps, fall back complete to lbfgs
Stricter test for glms with newton-cholesky solver on collinear data
test_linalg_warning_with_newton_solver
Notes on multi-class logistic / multinomial regressionIs this possible with newton-lsmr?
Each submatrix UpdateThere is some literature on this, in particular https://doi.org/10.1111/j.2517-6161.1992.tb01875.x. As this is not open access, I tried to derive it myself, resulting in: Cholesky Decomposition of multinomial covariance matrix
I verified that for import numpy as np
p = np.array([0.1, 0.2, 0.3, 0.4])
C = np.diag(p) - np.outer(p, p)
L = np.zeros_like(C)
for i in range(L.shape[0]):
L[i, i] = np.sqrt(p[i] * (1 - np.sum(p[:i+1])) / (1 - np.sum(p[:i])))
for i in range(L.shape[0]):
for j in range(i):
den = p[j] * (1 - np.sum(p[:j])) * (1 - np.sum(p[:j+1]))
L[i, j] = - p[i] <
F438
span class="pl-c1">* p[j] / np.sqrt(den)
np.allclose(L @ L.T, C) I think is makes multinomial newton-lsmr possible! |
LDL of multinomial covariance matrixThe square root free LDL decomposition, given in https://doi.org/10.1111/j.2517-6161.1992.tb01875.x, reads:
|
That's very promising. |
Adding multiclass/multinomial would be super cool, but I guess this PR would explode. I'd rather refer it to a later PR. |
I will open a new PR to add LSMR. I'll keep the branch as it is important for https://github.com/lorentzenchr/notebooks/blob/master/bench_glm_newton_solvers_cholesky_lsmr.ipynb. |
Reference Issues/PRs
This is a follow-up on #23314.
What does this implement/fix? Explain your changes.
It uses the machinery of #23314 and implements a further Newton solver based on LSMR for GLMs. It uses the IRLS formulation of the Newton step, which can be read as a normal equation of a least squares problem. This least squares problem is then solved via LSMR with a careful choice of the stopping criterion (
atol
).This solver is good for dense and sparse input, only uses matrix-vector products, can stop early when achieving the convergence criteria, can handle thin and wide data, can deal with singular (collinear) X, but may fall short for very ill-conditioned X.
Any other comments?
I have never seen the combination of IRLS with LSMR (or LSQR). Momentarily, this is my favorite solver 😉
As promised, LSMR seems to be able to stop a little earlier and therefore is a little faster than LSQR.