From 084d73e26a64b518c11f8392df2def47627ceadf Mon Sep 17 00:00:00 2001
From: Micky774 <zainmeekail@gmail.com>
Date: Fri, 4 Feb 2022 16:29:54 -0500
Subject: [PATCH 1/6] Initial attempt at restructuring logistic regression
 guide

---
 doc/modules/linear_model.rst | 71 ++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 15 deletions(-)

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 7243990bb5ffe..f6a618bfae4b7 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -860,28 +860,69 @@ regularization.
     that it improves numerical stability. No regularization amounts to
     setting C to a very high value.
 
-As an optimization problem, binary class :math:`\ell_2` penalized logistic
-regression minimizes the following cost function:
+Binary Case
+-----------
 
-.. math:: \min_{w, c} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
+For notational ease, we assume that the target :math:`y_i` takes values in the
+set :math:`{-1, 1}` at trial :math:`i`. As an optimization problem, binary
+class logistic regression using :math:`r(w)` regularization minimizes the
+following cost function:
 
-Similarly, :math:`\ell_1` regularized logistic regression solves the following
-optimization problem:
+.. math:: \min_{w, c} r(w) + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
+
+
+Multinomial Case
+----------------
+
+We may then extend logistic regression to obtain a multinomial estimator by
+considering the logistic regression as a `log-linear model
+<https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`.
+Note that t is possible in a :math:`K`-class context to parameterize the model
+using only :math:`K-1` weight vectors, leaving one class probability fully
+determined by the other class probabilities by leveraging the fact that all
+class probabilities must sum to one. We choose to overparameterize the model
+using :math:`K` weight vectors for ease of implementation and to preserve the
+symmetrical inductive bias regarding ordering of classes. This effect becomes
+especially important when using regularization.
 
-.. math:: \min_{w, c} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1).
 
-Elastic-Net regularization is a combination of :math:`\ell_1` and
-:math:`\ell_2`, and minimizes the following cost function:
+In the multinomial context with :math:`K`-many classes, we define the target
+vector of :math:`x_n` as :math:`y_n`, a binary vector with all zeros except for
+at element :math:`y_{n, t}` where :math:`t` is the true class of :math:`x_n`.
+Let :math:`C_i` be a binary vector with a :math:`0` for every element except
+for element :math:`k`, and :math:`W` be a matrix of weights where each vector
+:math:`W_k` corresponds to class :math:`k`. Then we find that
 
-.. math:: \min_{w, c} \frac{1 - \rho}{2}w^T w + \rho \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1),
+.. math:: p(y_n=C_k|x_n) = z_{n,k} = \frac{\exp (W_k^T x_n)}{\sum_j W_j^T x_n}
 
-where :math:`\rho` controls the strength of :math:`\ell_1` regularization vs.
-:math:`\ell_2` regularization (it corresponds to the `l1_ratio` parameter).
+ Then the multinomial logistic regression solves this
+optimization problem:
 
-Note that, in this notation, it's assumed that the target :math:`y_i` takes
-values in the set :math:`{-1, 1}` at trial :math:`i`. We can also see that
-Elastic-Net is equivalent to :math:`\ell_1` when :math:`\rho = 1` and equivalent
-to :math:`\ell_2` when :math:`\rho=0`.
+.. math:: \min_W r(W) - C\sum_n \sum_k y_{n,k} z_{n,k}
+
+.. note::
+
+   In the multinomial case, the regularization function internally flattens the
+   matrix of weights into a vector which is equivalent to concatenating each
+   individual vector :math:`W_k`. Then using the :math:`\ell_2` regularization,
+   the regularization equivalently takes the Frobenius norm:
+   :math:`r(W) = \|W\|_F`
+
+Regularization
+--------------
+We currently implement four choices of regularization term
+- None, :math:`r(w) = 0`
+- :math:`\ell_1,\, r(w) = \|w\|_1`
+- :math:`\ell_2,\, r(w) = \|w\|_2 = w^T w`
+- ElasticNet, :math:`r(w) = \frac{1 - \rho}{2}w^T w + \rho \|w\|_1`
+
+For ElasticNet, :math:`\rho` (which corresponds to the `l1_ratio` parameter)
+controls the strength of :math:`\ell_1` regularization vs. :math:`\ell_2`
+regularization. Elastic-Net is equivalent to :math:`\ell_1` when
+:math:`\rho = 1` and equivalent to :math:`\ell_2` when :math:`\rho=0`.
+
+Solvers
+-------
 
 The solvers implemented in the class :class:`LogisticRegression`
 are "liblinear", "newton-cg", "lbfgs", "sag" and "saga":

From 65ac1ecb8a198e902ba8e8cc3ab7e5dcba6a96f4 Mon Sep 17 00:00:00 2001
From: Micky774 <zainmeekail@gmail.com>
Date: Fri, 4 Feb 2022 16:39:25 -0500
Subject: [PATCH 2/6] Minor typo

---
 doc/modules/linear_model.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index f6a618bfae4b7..3921bf935a743 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -877,7 +877,7 @@ Multinomial Case
 We may then extend logistic regression to obtain a multinomial estimator by
 considering the logistic regression as a `log-linear model
 <https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`.
-Note that t is possible in a :math:`K`-class context to parameterize the model
+Note that it is possible in a :math:`K`-class context to parameterize the model
 using only :math:`K-1` weight vectors, leaving one class probability fully
 determined by the other class probabilities by leveraging the fact that all
 class probabilities must sum to one. We choose to overparameterize the model

From 3315027ff8be00439934e99cf905903279c3fd4d Mon Sep 17 00:00:00 2001
From: Meekail Zain <zainmeekail@gmail.com>
Date: Sat, 5 Feb 2022 16:46:26 -0500
Subject: [PATCH 3/6] Fixed snytax errors and rephrased explanation a bit

---
 doc/modules/linear_model.rst | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 3921bf935a743..e7e621bc7a6fa 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -864,7 +864,7 @@ Binary Case
 -----------
 
 For notational ease, we assume that the target :math:`y_i` takes values in the
-set :math:`{-1, 1}` at trial :math:`i`. As an optimization problem, binary
+set :math:`\{-1, 1\}` at trial :math:`i`. As an optimization problem, binary
 class logistic regression using :math:`r(w)` regularization minimizes the
 following cost function:
 
@@ -876,7 +876,7 @@ Multinomial Case
 
 We may then extend logistic regression to obtain a multinomial estimator by
 considering the logistic regression as a `log-linear model
-<https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`.
+<https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`_.
 Note that it is possible in a :math:`K`-class context to parameterize the model
 using only :math:`K-1` weight vectors, leaving one class probability fully
 determined by the other class probabilities by leveraging the fact that all
@@ -886,16 +886,16 @@ symmetrical inductive bias regarding ordering of classes. This effect becomes
 especially important when using regularization.
 
 
-In the multinomial context with :math:`K`-many classes, we define the target
-vector of :math:`x_n` as :math:`y_n`, a binary vector with all zeros except for
-at element :math:`y_{n, t}` where :math:`t` is the true class of :math:`x_n`.
-Let :math:`C_i` be a binary vector with a :math:`0` for every element except
-for element :math:`k`, and :math:`W` be a matrix of weights where each vector
-:math:`W_k` corresponds to class :math:`k`. Then we find that
+Let :math:`J_i` be a binary vector with a :math:`0` for every element except
+for element :math:`i`. In the multinomial context with :math:`K`-many classes,
+we define the target vector of :math:`x_n` as :math:`y_n=J_t` where :math:`t`
+is the true class of :math:`x_n`. Instead of a single weight vector, we now have
+a matrix of weights :math:`W` where each vector :math:`W_k` corresponds to class
+:math:`k`. Then we find that
 
-.. math:: p(y_n=C_k|x_n) = z_{n,k} = \frac{\exp (W_k^T x_n)}{\sum_j W_j^T x_n}
+.. math:: p(y_n=J_k|x_n) = z_{n,k} = \frac{\exp (W_k^T x_n)}{\sum_j W_j^T x_n}
 
- Then the multinomial logistic regression solves this
+Then the multinomial logistic regression solves this
 optimization problem:
 
 .. math:: \min_W r(W) - C\sum_n \sum_k y_{n,k} z_{n,k}
@@ -904,17 +904,18 @@ optimization problem:
 
    In the multinomial case, the regularization function internally flattens the
    matrix of weights into a vector which is equivalent to concatenating each
-   individual vector :math:`W_k`. Then using the :math:`\ell_2` regularization,
-   the regularization equivalently takes the Frobenius norm:
+   individual :math:`W_k` vector. Thus, for a matrix :math:`W`, using
+   :math:`\ell_2` regularization is equivalent to taking the Frobenius norm:
    :math:`r(W) = \|W\|_F`
 
 Regularization
 --------------
 We currently implement four choices of regularization term
-- None, :math:`r(w) = 0`
-- :math:`\ell_1,\, r(w) = \|w\|_1`
-- :math:`\ell_2,\, r(w) = \|w\|_2 = w^T w`
-- ElasticNet, :math:`r(w) = \frac{1 - \rho}{2}w^T w + \rho \|w\|_1`
+
+#. None, :math:`r(w) = 0`
+#. :math:`\ell_1,\, r(w) = \|w\|_1`
+#. :math:`\ell_2,\, r(w) = \|w\|_2 = w^T w`
+#. ElasticNet, :math:`r(w) = \frac{1 - \rho}{2}w^T w + \rho \|w\|_1`
 
 For ElasticNet, :math:`\rho` (which corresponds to the `l1_ratio` parameter)
 controls the strength of :math:`\ell_1` regularization vs. :math:`\ell_2`

From cfd9bb2461d85a3b14a28c6f16d18a85b2b622ca Mon Sep 17 00:00:00 2001
From: Meekail Zain <zainmeekail@gmail.com>
Date: Sat, 5 Feb 2022 19:50:43 -0500
Subject: [PATCH 4/6] Improved formatting and renamed variables for consistency

---
 doc/modules/linear_model.rst | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 38d6c14c8c18e..79924a394e0c0 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -877,33 +877,34 @@ Multinomial Case
 We may then extend logistic regression to obtain a multinomial estimator by
 considering the logistic regression as a `log-linear model
 <https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`_.
-Note that it is possible in a :math:`K`-class context to parameterize the model
-using only :math:`K-1` weight vectors, leaving one class probability fully
-determined by the other class probabilities by leveraging the fact that all
-class probabilities must sum to one. We choose to overparameterize the model
-using :math:`K` weight vectors for ease of implementation and to preserve the
-symmetrical inductive bias regarding ordering of classes. This effect becomes
-especially important when using regularization.
 
+.. note::
+   It is possible in a :math:`K`-class context to parameterize the model
+   using only :math:`K-1` weight vectors, leaving one class probability fully
+   determined by the other class probabilities by leveraging the fact that all
+   class probabilities must sum to one. We choose to overparameterize the model
+   using :math:`K` weight vectors for ease of implementation and to preserve the
+   symmetrical inductive bias regarding ordering of classes. This effect becomes
+   especially important when using regularization.
 
 Let :math:`J_i` be a binary vector with a :math:`0` for every element except
 for element :math:`i`. In the multinomial context with :math:`K`-many classes,
-we define the target vector of :math:`x_n` as :math:`y_n=J_t` where :math:`t`
-is the true class of :math:`x_n`. Instead of a single weight vector, we now have
+we define the target vector of :math:`X_n` as :math:`Y_n=J_t` where :math:`t`
+is the true class of :math:`X_n`. Instead of a single weight vector, we now have
 a matrix of weights :math:`W` where each vector :math:`W_k` corresponds to class
-:math:`k`. Then we find that
+:math:`k`. Then we can define the evidence vector :math:`z_n` component-wise as:
 
-.. math:: p(y_n=J_k|x_n) = z_{n,k} = \frac{\exp (W_k^T x_n)}{\sum_j W_j^T x_n}
+.. math:: p(Y_n=J_k|X_n) = z_{n,k} = \frac{\exp (W_k^T X_n)}{\sum_j \exp (W_j^T X_n)}
 
 Then the multinomial logistic regression solves this
 optimization problem:
 
-.. math:: \min_W r(W) - C\sum_n \sum_k y_{n,k} z_{n,k}
+.. math:: \min_W r(W) - C\sum_n Y_n^T z_n
 
 .. note::
 
    In the multinomial case, the regularization function internally flattens the
-   matrix of weights into a vector which is equivalent to concatenating each
+   matrix of weights into a vector. This is equivalent to concatenating each
    individual :math:`W_k` vector. Thus, for a matrix :math:`W`, using
    :math:`\ell_2` regularization is equivalent to taking the Frobenius norm:
    :math:`r(W) = \|W\|_F`

From 40dc10e1c55118efa898525e2864d197cf3d8c9b Mon Sep 17 00:00:00 2001
From: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Date: Sat, 12 Feb 2022 14:54:16 -0500
Subject: [PATCH 5/6] Apply suggestions from code review

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
---
 doc/modules/linear_model.rst | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index 79924a394e0c0..9bd210f887e61 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -864,27 +864,27 @@ Binary Case
 -----------
 
 For notational ease, we assume that the target :math:`y_i` takes values in the
-set :math:`\{-1, 1\}` at trial :math:`i`. As an optimization problem, binary
-class logistic regression using :math:`r(w)` regularization minimizes the
+set :math:`\{0, 1\}` for data point :math:`i`. As an optimization problem, binary
+class logistic regression with regularization term :math:`r(w)`  minimizes the
 following cost function:
 
-.. math:: \min_{w, c} r(w) + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
+.. math:: \min_{w, c} r(w) + C \sum_{i=1}^n \log(1 + \exp(X_i^T w + w_0)) - y_i * (X_i^T w + w_0).
 
+Once fitter, the ``predict_proba`` method of ``LogisticRegression`` predicts the class probability of :math:`P(y_i=1|X_i) = \expit(X_i^T w + w_0) = \frac{1}{1 + \exp(-X_i^T w - w_0)}`.
 
 Multinomial Case
 ----------------
 
-We may then extend logistic regression to obtain a multinomial estimator by
-considering the logistic regression as a `log-linear model
+The binary case can be extended to :math:`K`-classes leading to the multinomial logistic regression, see also `log-linear model
 <https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model>`_.
 
 .. note::
    It is possible in a :math:`K`-class context to parameterize the model
    using only :math:`K-1` weight vectors, leaving one class probability fully
    determined by the other class probabilities by leveraging the fact that all
-   class probabilities must sum to one. We choose to overparameterize the model
+   class probabilities must sum to one. We deliberately choose to overparameterize the model
    using :math:`K` weight vectors for ease of implementation and to preserve the
-   symmetrical inductive bias regarding ordering of classes. This effect becomes
+   symmetrical inductive bias regarding ordering of classes, see [1].. This effect becomes
    especially important when using regularization.
 
 Let :math:`J_i` be a binary vector with a :math:`0` for every element except
@@ -896,7 +896,7 @@ a matrix of weights :math:`W` where each vector :math:`W_k` corresponds to class
 
 .. math:: p(Y_n=J_k|X_n) = z_{n,k} = \frac{\exp (W_k^T X_n)}{\sum_j \exp (W_j^T X_n)}
 
-Then the multinomial logistic regression solves this
+Finding the weight matrix $W$ corresponds to solving the following
 optimization problem:
 
 .. math:: \min_W r(W) - C\sum_n Y_n^T z_n

From 528dfe890f8e601f95b8d550856e322909286d32 Mon Sep 17 00:00:00 2001
From: Meekail Zain <zainmeekail@gmail.com>
Date: Sat, 12 Feb 2022 16:48:16 -0500
Subject: [PATCH 6/6] Unpacked figures from subplots to individual plots

---
 examples/manifold/plot_lle_digits.py | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/examples/manifold/plot_lle_digits.py b/examples/manifold/plot_lle_digits.py
index 3a30ad2256762..dd4048e400ddd 100644
--- a/examples/manifold/plot_lle_digits.py
+++ b/examples/manifold/plot_lle_digits.py
@@ -48,8 +48,11 @@
 from sklearn.preprocessing import MinMaxScaler
 
 
-def plot_embedding(X, title, ax):
+def plot_embedding(X, title):
     X = MinMaxScaler().fit_transform(X)
+    plt.figure()
+    ax = plt.subplot(111)
+
     for digit in digits.target_names:
         ax.scatter(
             *X[y == digit].T,
@@ -175,15 +178,8 @@ def plot_embedding(X, title, ax):
 
 # %%
 # Finally, we can plot the resulting projection given by each method.
-from itertools import zip_longest
-
-fig, axs = plt.subplots(nrows=7, ncols=2, figsize=(17, 24))
-
-for name, ax in zip_longest(timing, axs.ravel()):
-    if name is None:
-        ax.axis("off")
-        continue
+for name in timing:
     title = f"{name} (time {timing[name]:.3f}s)"
-    plot_embedding(projections[name], title, ax)
+    plot_embedding(projections[name], title)
 
 plt.show()