10000 FEA add GLM Newton-LSMR Solver on top of Newton-Cholesky by lorentzenchr · Pull Request #23507 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FEA add GLM Newton-LSMR Solver on top of Newton-Cholesky #23507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 108 commits into from

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

This is a follow-up on #23314.

What does this implement/fix? Explain your changes.

It uses the machinery of #23314 and implements a further Newton solver based on LSMR for GLMs. It uses the IRLS formulation of the Newton step, which can be read as a normal equation of a least squares problem. This least squares problem is then solved via LSMR with a careful choice of the stopping criterion (atol).

This solver is good for dense and sparse input, only uses matrix-vector products, can stop early when achieving the convergence criteria, can handle thin and wide data, can deal with singular (collinear) X, but may fall short for very ill-conditioned X.

Any other comments?

I have never seen the combination of IRLS with LSMR (or LSQR). Momentarily, this is my favorite solver 😉

As promised, LSMR seems to be able to stop a little earlier and therefore is a little faster than LSQR.

@lorentzenchr
Copy link
Member Author
lorentzenchr commented May 31, 2022

Some benchmark

As in #23314.
Updated with a115afa.

import numpy as np
from sklearn._loss import HalfPoissonLoss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.linear_model import PoissonRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer


df = fetch_openml(data_id=41214, as_frame=True).frame
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
log_scale_transformer = make_pipeline(
    FunctionTransformer(np.log, validate=False), StandardScaler()
)
linear_model_preprocessor = ColumnTransformer(
    [
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        ("binned_numeric", KBinsDiscretizer(n_bins=10, subsample=None), ["VehAge", "DrivAge"]),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
        (
            "onehot_categorical",
            OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
    ],
    remainder="drop",
)
y = df["Frequency"]
w = df["Exposure"]
X = linear_model_preprocessor.fit_transform(df)

Note that X is sparse

%time PoissonRegressor(solver="lbfgs", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w); None

Wall time: 5.01 s

%time PoissonRegressor(solver="newton-cholesky", alpha=1e-4, tol=1e-4).fit(X, y, sample_weight=w); None

Wall time: 1.98 s

%time PoissonRegressor(solver="newton-lsmr", alpha=1e-4, tol=1e-4).fit(X, y, sample_weight=w); None

Wall time: 1.1 s

On dense data with

X = X.toarray()

I get

  • lbfgs: 18.5 s
  • newton-cholesky: 2.51 s
  • newton-lsmr: 4.35 s

They achieve different losses and gradients:

# loss
reg_lbfgs = PoissonRegressor(solver="lbfgs", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w)
reg_cholesky = PoissonRegressor(solver="newton-cholesky", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w)
reg_lsmr = PoissonRegressor(solver="newton-lsmr", alpha=1e-4, tol=1e-4, max_iter=1000).fit(X, y, sample_weight=w)

lml = LinearModelLoss(base_loss=HalfPoissonLoss(), fit_intercept=True)

for reg, solver in [(reg_lbfgs, "lbfgs"), (reg_cholesky, "cholesky"), (reg_lsmr, "lsmr")]:
    this_loss = lml.loss(
        coef=np.r_[reg.coef_, reg.intercept_],
        X=X,
        y=np.asarray(y),
        sample_weight=np.asarray(w) / np.sum(w),
        l2_reg_strength=1e-4,
    )
    this_grad = lml.gradient(
        coef=np.r_[reg.coef_, reg.intercept_],
        X=X,
        y=np.asarray(y),
        sample_weight=np.asarray(w) / np.sum(w),
        l2_reg_strength=1e-4,
    )
    print(f"solver={solver:<8} has loss = {this_loss:0.10f} "
          f"gradient 2-norm = {np.linalg.norm(this_grad):0.10f} "
          f"gradient inf-norm = {np.linalg.norm(this_grad, ord=np.inf)}"
         )
solver=lbfgs    has loss = 0.3159339334 gradient 2-norm = 0.0001658628 gradient inf-norm = 7.403807907072436e-05
solver=cholesky has loss = 0.3159052779 gradient 2-norm = 0.0000049585 gradient inf-norm = 4.956622873102174e-06
solver=lsmr     has loss = 0.3159284188 gradient 2-norm = 0.0000654960 gradient inf-norm = 3.575219334355157e-05
8000

@ogrisel
Copy link
Member
ogrisel commented Jun 1, 2022

Interesting! Since the lsmr sover is the fastest but stops at a higher loss, maybe the stopping criterion are not directly comparable.

Could you please run benchmarks with a grid of tolerance values to check if one solver is Pareto optimal (uniformly dominates the others on this tasks)?

#23314 (review)

Also, you might want to contribute both those solvers and the Poisson regression task to the benchopt project:

https://benchopt.github.io/

@lorentzenchr
Copy link
Member Author
lorentzenchr commented Jun 1, 2022

Solver profiles

Poisson Regression on French MTPL

As of a115afa.
Sparse X with X.shape = (678013, 75)

import warnings
from pathlib import Path
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import PoissonRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib


def prepare_data():
    df = fetch_openml(data_id=41214, as_frame=True).frame
    df["Frequency"] = df["ClaimNb"] / df["Exposure"]
    log_scale_transformer = make_pipeline(
        FunctionTransformer(np.log, validate=False), StandardScaler()
    )
    linear_model_preprocessor = ColumnTransformer(
        [
            ("passthrough_numeric", "passthrough", ["BonusMalus"]),
            (
                "binned_numeric",
                KBinsDiscretizer(n_bins=10, subsample=None),
                ["VehAge", "DrivAge"],
            ),
            ("log_scaled_numeric", log_scale_transformer, ["Density"]),
            (
                "onehot_categorical",
                OneHotEncoder(),
                ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
            ),
        ],
        remainder="drop",
    )
    y = df["Frequency"]
    w = df["Exposure"]
    X = linear_model_preprocessor.fit_transform(df)
    return X, y, w


X, y, w = prepare_data()
# X = X.toarray()
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, train_size=100_000, test_size=10_000, random_state=0
)
print(f"{X_train.shape = }")

results = []
loss_sw = np.full_like(y_train, fill_value=(1. / y_train.shape[0]))
# loss_
8000
sw = None
for tol in np.logspace(-1, -10, 10):
    for solver in ["lbfgs", "newton-cholesky", "newton-lsmr"]:
        tic = perf_counter()
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=ConvergenceWarning)
            reg = PoissonRegressor(
                alpha=1e-4, solver=solver, tol=tol, max_iter=10000
            ).fit(X_train, y_train)
        toc = perf_counter()
        train_time = toc - tic
        train_loss = LinearModelLoss(
            base_loss=reg._get_loss(), fit_intercept=reg.fit_intercept
        ).loss(
            coef=np.r_[reg.coef_, reg.intercept_],
            X=X_train,
            y=y_train.values,
            l2_reg_strength=reg.alpha,
            sample_weight=loss_sw,
        )
        result = {
            "solver": solver,
            "tol": tol,
            "train_loss": train_loss,
            "train_time": train_time,
            "train_score": reg.score(X_train, y_train),
            "test_score": reg.score(X_test, y_test),
            "n_iter": reg.n_iter_,
            "converged": reg.n_iter_ < reg.max_iter,
        }
        print(result)
        results.append(result)


results = pd.DataFrame.from_records(results)
results.to_csv(Path().resolve() / "bench_poisson_reg.csv")

Plotting

import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt


results = pd.read_csv(Path().resolve() / "bench_poisson_reg.csv")
results["suboptimality"] = results["train_loss"] - results["train_loss"].min() + 1e-15
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
    group.sort_values("tol").plot(x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by iterations")

fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
    group.sort_values("tol").plot(x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by time")

image
image

Binary Logistic Regression on Kicks Dataset

As of 9de52bf.
Sparse X with X.shape = (72983, 1076)

import warnings
from pathlib import Path
import numpy as np
from sklearn._loss import HalfBinomialLoss
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.linear_model._glm.glm import _GeneralizedLinearRegressor
from sklearn.linear_model._linear_loss import LinearModelLoss
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from time import perf_counter
import pandas as pd
import joblib


class BinomialRegressor(_GeneralizedLinearRegressor):
    def _get_loss(self):
        return HalfBinomialLoss()


def prepare_data():
    df = fetch_openml(data_id=41162, as_frame=True, parser="auto").frame
    linear_model_preprocessor = ColumnTransformer(
        [
            (
                "passthrough_numeric",
                make_pipeline(SimpleImputer(), StandardScaler()),
                [
                    "MMRAcquisitionAuctionAveragePrice",
                    "MMRAcquisitionAuctionCleanPrice",
                    "MMRCurrentAuctionAveragePrice",
                    "MMRCurrentAuctionCleanPrice",
                    "MMRCurrentRetailAveragePrice",
                    "MMRCurrentRetailCleanPrice",
                    "MMRCurrentRetailAveragePrice",
                    "MMRCurrentRetailCleanPrice",
                    "VehBCost",
                    "VehOdo",
                    "VehYear",
                    "VehicleAge",
                    "WarrantyCost",
                ],
            ),
            (
                "onehot_categorical",
                OneHotEncoder(min_frequency=10),
                [
                    "Auction",
                    "Color",
                    "IsOnlineSale",
                    "Make",
                    "Model",
                    "Nationality",
                    "Size",
                    "SubModel",
                    "Transmission",
                    "Trim",
                    "WheelType",
                ],
            ),
        ],
        remainder="drop",
    )
    y = np.asarray(df["IsBadBuy"] == "1", dtype=float)
    X = linear_model_preprocessor.fit_transform(df)
    return X, y


X, y = prepare_data()
# X = X.toarray()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.9, random_state=0
)
print(f"{X_train.shape = }")

results = []
loss_sw = np.full_like(y_train, fill_value=(1. / y_train.shape[0]))
alpha = 1e-4
for tol in np.logspace(-1, -10, 10):
    for solver in ["lbfgs", "newton-cholesky", "newton-lsmr", "logreg-lbfgs", "logreg-newton-cg"]:
        tic = perf_counter()
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=ConvergenceWarning)
            if solver in ["lbfgs", "newton-cholesky", "newton-lsmr"]:
                reg = BinomialRegressor(
                    alpha=alpha, solver=solver, tol=tol, max_iter=10000
                ).fit(X_train, y_train)
            else:
                reg = LogisticRegression(
                    solver=solver[7:], C=1 / alpha / X.shape[0], tol=tol, max_iter=1000
                ).fit(X_train, y_train)
        toc = perf_counter()
        train_time = toc - tic
        train_loss = LinearModelLoss(
            base_loss=HalfBinomialLoss(), fit_intercept=reg.fit_intercept
        ).loss(
            coef=np.r_[np.squeeze(reg.coef_), reg.intercept_],
            X=X_train,
            y=y_train,
            l2_reg_strength=alpha,
            sample_weight=loss_sw,
        )
        result = {
            "solver": solver,
            "tol": tol,
            "train_loss": train_loss,
            "train_time": train_time,
            "train_score": reg.score(X_train, y_train),
            "test_score": reg.score(X_test, y_test),
            "n_iter": np.squeeze(reg.n_iter_),
            "converged": np.squeeze(reg.n_iter_) < np.squeeze(reg.max_iter),
        }
        print(result)
        results.append(result)


results = pd.DataFrame.from_records(results)
results.to_csv(Path().resolve() / "bench_logistic_reg.csv")

Plotting

import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt


results = pd.read_csv(Path().resolve() / "bench_logistic_reg.csv")
results["suboptimality"] = results["train_loss"] - results["train_loss
8000
"].min() + 1e-15
fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
    group.sort_values("tol").plot(x="n_iter", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by iterations, sparse X")


fig, ax = plt.subplots(figsize=(8, 6))
for label, group in results.groupby("solver"):
    group.sort_values("tol").plot(x="train_time", y="suboptimality", loglog=True, marker="o", label=label, ax=ax)
ax.set_ylabel("suboptimality")
ax.set_title("Suboptimality by time, sparse X")

image
image

Summary

  • Note that results look similar with dense X.
  • For these datasets with n_samples > n_features and some OHE features, newton-cholesky is the fastest for higher precision (smaller tol or suboptimality)
  • For lower precision, newton-lsmr or lbfgs can be faster, which one depends on the dataset.
  • The behaviour of LogisticRegression (logreg-lbfgs, logreg-newton-cg) is very strange. These solvers seem unable to converge any further to the true minimum. This is all the more strange as the difference between logreg-lbfgs und lbfgs is only the normalization of the objective function (logreg-lbfgs uses sum(loss_i), lbfgs uses mean(loss_i)), all the rest is exactly identical!!!
    Conclusion: mean(objective) has numerical better stopping criteria than sum(objective).

@ogrisel
Copy link
Member
ogrisel commented Jun 2, 2022

BTW, thanks for the benchmarks, it also looks like a strong contender. It would be interesting to systematically compare all those solvers for various data shapes (n_samples, n_features, n_classes/n_outputs), duplicated/collinear features, regularization and also measure memory usage on dense and sparse data.

@ogrisel
Copy link
Member
ogrisel commented Jun 2, 2022

can deal with singular (collinear) X, but may fall short for very ill-conditioned X

Not sure what's this means. It can handle collinear data as long as n_samples >> n_uniquely_informative_features?

# Estimated Frobenius norm of A and norm of residual r for tolerance of
# next iteration.
self.A_norm = result[5]
self.r_norm = result[3]
Copy link
Member
@ogrisel ogrisel Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to issue a warning (and suggest increasing the regularization) if result[1] == 7 (istop) which means that lsmr has not converged with the default maxiter(which should ben_features` by construction in our case), possibly because of ill-conditioning.

Instead of warning might even want to experiment with doing more iterations of lsmr when we detect this situation according to the docstring of lsmr, see: https://github.com/scipy/scipy/blob/main/scipy/sparse/linalg/_isolve/lsmr.py

and then only warn if we do not converge with maxiter=2 * n_features in total for instance (e.g. running lsmr twice in a row if the first call resulted in istop == 7).

Note that issuing warning when the inner solver failed to converge while still proceeding with the outer loop might just result in a wall of warnings and a badly converged model. Maybe we could also choose to just early interrupt the outer solver loop with a ConvergenceWarning in such a case. This is related to: lorentzenchr#6 (that investigates a similar case for the "newton-cholesky" solver).

@lorentzenchr
Copy link
Member Author
lorentzenchr commented Jul 2, 2022

Notes on multi-class logistic / multinomial regression

Is this possible with newton-lsmr?
Consider 3 classes. We concatenate the coefficient/weight of each class, i.e. coef = (coeffs_class_1, coeffs_class_2, coeffs_class_3). The Hessian becomes

(X'    ) (W11 W12 W13) (X  )
(  X'  ) (W12 W22 W23) ( X )
(    X') (W13 W23 W33) (  X)

Each submatrix Wij is diagonal of dimension n_samples * n_samples, Wii=p_i (1-p_i) and Wij=-pi * p_j with predicted probabilities p_i. For LSMR to work, we need to find a square root of the whole W, say V with V'V=W. One could achieve this via a cholesky decomposition, but W is very sparse and very large, i.e. W.shape = (n_classes * n_samples, n_classes * n_samples). I tried to find analytic expressions, but so far without luck.
See Chapter 3.2 of https://www.jstatsoft.org/article/view/v075i03.

Update

There is some literature on this, in particular https://doi.org/10.1111/j.2517-6161.1992.tb01875.x. As this is not open access, I tried to derive it myself, resulting in:

Cholesky Decomposition of multinomial covariance matrix

Mathematical sum convention, starting from 1.
p = n-dimentional vector of probabilities summing to 1.
C = diag(p) - p'p, meaning C_ij = p_i - p_i p_j

Cholesky decomposition of C = L' L with lower triangular L.
L_ii = sqrt(p_i * (1 - sum(p_k, k=1..i)) / (1 - sum(p_k, k=1..i-1)))
Note that L_nn = 0.

for i>j
L_ij = -p_i p_j / sqrt(p_i * (1 - sum(p_k, k=1..j-1)) * (1 - sum(p_k, k=1..j)))

I verified that for n=4

import numpy as np

p = np.array([0.1, 0.2, 0.3, 0.4])
C = np.diag(p) - np.outer(p, p)

L = np.zeros_like(C)
for i in range(L.shape[0]):
    L[i, i] = np.sqrt(p[i] * (1 - np.sum(p[:i+1])) / (1 - np.sum(p[:i])))
for i in range(L.shape[0]):
    for j in range(i):
        den = p[j] * (1 - np.sum(p[:j])) * (1 - np.sum(p[:j+1]))
        L[i, j] = - p[i] <
F438
span class="pl-c1">* p[j] / np.sqrt(den)

np.allclose(L @ L.T, C)

I think is makes multinomial newton-lsmr possible!

@lorentzenchr
Copy link
Member Author
lorentzenchr commented Jul 4, 2022

LDL of multinomial covariance matrix

The square root free LDL decomposition, given in https://doi.org/10.1111/j.2517-6161.1992.tb01875.x, reads:

Mathematical sum convention, starting from 1.
p = n-dimentional vector of probabilities summing to 1.
C = diag(p) - p'p, meaning C_ij = p_i - p_i p_j

LDL (square root free Cholesky) decomposition of C = L' D L with lower triangular L and diagonal D.
q_0 = 1
q_i = 1 - sum(p_k, k=1..i)

L_ii = 1
for i>j:
    L_ij = -p_i / q_i

D_ii = p_i * q_i / q_{i-1}

If the p_i sum to 1, then q_n = 0 and d_n = 0.

@ogrisel
Copy link
Member
ogrisel commented Jul 20, 2022

I think is makes multinomial newton-lsmr possible!

That's very promising.

@lorentzenchr
Copy link
Member Author

Adding multiclass/multinomial would be super cool, but I guess this PR would explode. I'd rather refer it to a later PR.

B4DF
@lorentzenchr lorentzenchr changed the title FEA add GLM Newton-LSMR Solver FEA add GLM Newton-LSMR Solver on top of Newton-Cholesky Nov 3, 2022
@lorentzenchr
Copy link
Member Author

I will open a new PR to add LSMR. I'll keep the branch as it is important for https://github.com/lorentzenchr/notebooks/blob/master/bench_glm_newton_solvers_cholesky_lsmr.ipynb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0