diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst index 6d176e8482537..9bd210f887e61 100644 --- a/doc/modules/linear_model.rst +++ b/doc/modules/linear_model.rst @@ -860,28 +860,71 @@ regularization. that it improves numerical stability. No regularization amounts to setting C to a very high value. -As an optimization problem, binary class :math:`\ell_2` penalized logistic -regression minimizes the following cost function: +Binary Case +----------- -.. math:: \min_{w, c} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) . +For notational ease, we assume that the target :math:`y_i` takes values in the +set :math:`\{0, 1\}` for data point :math:`i`. As an optimization problem, binary +class logistic regression with regularization term :math:`r(w)` minimizes the +following cost function: -Similarly, :math:`\ell_1` regularized logistic regression solves the following +.. math:: \min_{w, c} r(w) + C \sum_{i=1}^n \log(1 + \exp(X_i^T w + w_0)) - y_i * (X_i^T w + w_0). + +Once fitter, the ``predict_proba`` method of ``LogisticRegression`` predicts the class probability of :math:`P(y_i=1|X_i) = \expit(X_i^T w + w_0) = \frac{1}{1 + \exp(-X_i^T w - w_0)}`. + +Multinomial Case +---------------- + +The binary case can be extended to :math:`K`-classes leading to the multinomial logistic regression, see also `log-linear model +`_. + +.. note:: + It is possible in a :math:`K`-class context to parameterize the model + using only :math:`K-1` weight vectors, leaving one class probability fully + determined by the other class probabilities by leveraging the fact that all + class probabilities must sum to one. We deliberately choose to overparameterize the model + using :math:`K` weight vectors for ease of implementation and to preserve the + symmetrical inductive bias regarding ordering of classes, see [1].. This effect becomes + especially important when using regularization. + +Let :math:`J_i` be a binary vector with a :math:`0` for every element except +for element :math:`i`. In the multinomial context with :math:`K`-many classes, +we define the target vector of :math:`X_n` as :math:`Y_n=J_t` where :math:`t` +is the true class of :math:`X_n`. Instead of a single weight vector, we now have +a matrix of weights :math:`W` where each vector :math:`W_k` corresponds to class +:math:`k`. Then we can define the evidence vector :math:`z_n` component-wise as: + +.. math:: p(Y_n=J_k|X_n) = z_{n,k} = \frac{\exp (W_k^T X_n)}{\sum_j \exp (W_j^T X_n)} + +Finding the weight matrix $W$ corresponds to solving the following optimization problem: -.. math:: \min_{w, c} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1). +.. math:: \min_W r(W) - C\sum_n Y_n^T z_n + +.. note:: -Elastic-Net regularization is a combination of :math:`\ell_1` and -:math:`\ell_2`, and minimizes the following cost function: + In the multinomial case, the regularization function internally flattens the + matrix of weights into a vector. This is equivalent to concatenating each + individual :math:`W_k` vector. Thus, for a matrix :math:`W`, using + :math:`\ell_2` regularization is equivalent to taking the Frobenius norm: + :math:`r(W) = \|W\|_F` + +Regularization +-------------- +We currently implement four choices of regularization term -.. math:: \min_{w, c} \frac{1 - \rho}{2}w^T w + \rho \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1), +#. None, :math:`r(w) = 0` +#. :math:`\ell_1,\, r(w) = \|w\|_1` +#. :math:`\ell_2,\, r(w) = \|w\|_2 = w^T w` +#. ElasticNet, :math:`r(w) = \frac{1 - \rho}{2}w^T w + \rho \|w\|_1` -where :math:`\rho` controls the strength of :math:`\ell_1` regularization vs. -:math:`\ell_2` regularization (it corresponds to the `l1_ratio` parameter). +For ElasticNet, :math:`\rho` (which corresponds to the `l1_ratio` parameter) +controls the strength of :math:`\ell_1` regularization vs. :math:`\ell_2` +regularization. Elastic-Net is equivalent to :math:`\ell_1` when +:math:`\rho = 1` and equivalent to :math:`\ell_2` when :math:`\rho=0`. -Note that, in this notation, it's assumed that the target :math:`y_i` takes -values in the set :math:`{-1, 1}` at trial :math:`i`. We can also see that -Elastic-Net is equivalent to :math:`\ell_1` when :math:`\rho = 1` and equivalent -to :math:`\ell_2` when :math:`\rho=0`. +Solvers +------- The solvers implemented in the class :class:`LogisticRegression` are "liblinear", "newton-cg", "lbfgs", "sag" and "saga": diff --git a/examples/manifold/plot_lle_digits.py b/examples/manifold/plot_lle_digits.py index 3a30ad2256762..dd4048e400ddd 100644 --- a/examples/manifold/plot_lle_digits.py +++ b/examples/manifold/plot_lle_digits.py @@ -48,8 +48,11 @@ from sklearn.preprocessing import MinMaxScaler -def plot_embedding(X, title, ax): +def plot_embedding(X, title): X = MinMaxScaler().fit_transform(X) + plt.figure() + ax = plt.subplot(111) + for digit in digits.target_names: ax.scatter( *X[y == digit].T, @@ -175,15 +178,8 @@ def plot_embedding(X, title, ax): # %% # Finally, we can plot the resulting projection given by each method. -from itertools import zip_longest - -fig, axs = plt.subplots(nrows=7, ncols=2, figsize=(17, 24)) - -for name, ax in zip_longest(timing, axs.ravel()): - if name is None: - ax.axis("off") - continue +for name in timing: title = f"{name} (time {timing[name]:.3f}s)" - plot_embedding(projections[name], title, ax) + plot_embedding(projections[name], title) plt.show()