scikit-learn · Micky774 · Feb 4, 2022 · Feb 4, 2022 · Feb 5, 2022 · Feb 5, 2022
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -860,28 +860,71 @@ regularization.
     that it improves numerical stability. No regularization amounts to
     setting C to a very high value.
 
-As an optimization problem, binary class :math:`\ell_2` penalized logistic
-regression minimizes the following cost function:
+Binary Case
+-----------
 
-.. math:: \min_{w, c} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
+For notational ease, we assume that the target :math:`y_i` takes values in the
+set :math:`\{0, 1\}` for data point :math:`i`. As an optimization problem, binary
+class logistic regression with regularization term :math:`r(w)`  minimizes the
+following cost function:
 
-Similarly, :math:`\ell_1` regularized logistic regression solves the following
+.. math:: \min_{w, c} r(w) + C \sum_{i=1}^n \log(1 + \exp(X_i^T w + w_0)) - y_i * (X_i^T w + w_0).
+
+Once fitter, the ``predict_proba`` method of ``LogisticRegression`` predicts the class probability of :math:`P(y_i=1|X_i) = \expit(X_i^T w + w_0) = \frac{1}{1 + \exp(-X_i^T w - w_0)}`.
+
+Multinomial Case
+----------------
+
+The binary case can be extended to :math:`K`-classes leading to the multinomial logistic regression, see also `log-linear model
+<https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`_.
+
+.. note::
+   It is possible in a :math:`K`-class context to parameterize the model
+   using only :math:`K-1` weight vectors, leaving one class probability fully
+   determined by the other class probabilities by leveraging the fact that all
+   class probabilities must sum to one. We deliberately choose to overparameterize the model
+   using :math:`K` weight vectors for ease of implementation and to preserve the
+   symmetrical inductive bias regarding ordering of classes, see [1].. This effect becomes
+   especially important when using regularization.
+
+Let :math:`J_i` be a binary vector with a :math:`0` for every element except
+for element :math:`i`. In the multinomial context with :math:`K`-many classes,
+we define the target vector of :math:`X_n` as :math:`Y_n=J_t` where :math:`t`
+is the true class of :math:`X_n`. Instead of a single weight vector, we now have
+a matrix of weights :math:`W` where each vector :math:`W_k` corresponds to class
+:math:`k`. Then we can define the evidence vector :math:`z_n` component-wise as:
+
+.. math:: p(Y_n=J_k|X_n) = z_{n,k} = \frac{\exp (W_k^T X_n)}{\sum_j \exp (W_j^T X_n)}
+
+Finding the weight matrix $W$ corresponds to solving the following
 optimization problem:
 
-.. math:: \min_{w, c} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1).
+.. math:: \min_W r(W) - C\sum_n Y_n^T z_n
+
+.. note::
 
-Elastic-Net regularization is a combination of :math:`\ell_1` and
-:math:`\ell_2`, and minimizes the following cost function:
+   In the multinomial case, the regularization function internally flattens the
+   matrix of weights into a vector. This is equivalent to concatenating each
+   individual :math:`W_k` vector. Thus, for a matrix :math:`W`, using
+   :math:`\ell_2` regularization is equivalent to taking the Frobenius norm:
+   :math:`r(W) = \|W\|_F`
+
+Regularization
+--------------
+We currently implement four choices of regularization term
 
-.. math:: \min_{w, c} \frac{1 - \rho}{2}w^T w + \rho \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1),
+#. None, :math:`r(w) = 0`
+#. :math:`\ell_1,\, r(w) = \|w\|_1`
+#. :math:`\ell_2,\, r(w) = \|w\|_2 = w^T w`
+#. ElasticNet, :math:`r(w) = \frac{1 - \rho}{2}w^T w + \rho \|w\|_1`
 
-where :math:`\rho` controls the strength of :math:`\ell_1` regularization vs.
-:math:`\ell_2` regularization (it corresponds to the `l1_ratio` parameter).
+For ElasticNet, :math:`\rho` (which corresponds to the `l1_ratio` parameter)
+controls the strength of :math:`\ell_1` regularization vs. :math:`\ell_2`
+regularization. Elastic-Net is equivalent to :math:`\ell_1` when
+:math:`\rho = 1` and equivalent to :math:`\ell_2` when :math:`\rho=0`.
 
-Note that, in this notation, it's assumed that the target :math:`y_i` takes
-values in the set :math:`{-1, 1}` at trial :math:`i`. We can also see that
-Elastic-Net is equivalent to :math:`\ell_1` when :math:`\rho = 1` and equivalent
-to :math:`\ell_2` when :math:`\rho=0`.
+Solvers
+-------
 
 The solvers implemented in the class :class:`LogisticRegression`
 are "liblinear", "newton-cg", "lbfgs", "sag" and "saga":

diff --git a/examples/manifold/plot_lle_digits.py b/examples/manifold/plot_lle_digits.py
@@ -48,8 +48,11 @@
 from sklearn.preprocessing import MinMaxScaler
 
 
-def plot_embedding(X, title, ax):
+def plot_embedding(X, title):
     X = MinMaxScaler().fit_transform(X)
+    plt.figure()
+    ax = plt.subplot(111)
+
     for digit in digits.target_names:
         ax.scatter(
             *X[y == digit].T,
@@ -175,15 +178,8 @@ def plot_embedding(X, title, ax):
 
 # %%
 # Finally, we can plot the resulting projection given by each method.
-from itertools import zip_longest
-
-fig, axs = plt.subplots(nrows=7, ncols=2, figsize=(17, 24))
-
-for name, ax in zip_longest(timing, axs.ravel()):
-    if name is None:
-        ax.axis("off")
-        continue
+for name in timing:
     title = f"{name} (time {timing[name]:.3f}s)"
-    plot_embedding(projections[name], title, ax)
+    plot_embedding(projections[name], title)
 
 plt.show()