8000 DOC Math/code formatting in docs by joshhilton · Pull Request #31325 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC Math/code formatting in docs #31325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 14, 2025
12 changes: 6 additions & 6 deletions doc/modules/calibration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ difficulty making predictions near 0 and 1 because variance in the
underlying base models will bias predictions that should be near zero or one
away from these values. Because predictions are restricted to the interval
[0,1], errors caused by variance tend to be one-sided near zero and one. For
example, if a model should predict p = 0 for a case, the only way bagging
example, if a model should predict :math:`p = 0` for a case, the only way bagging
can achieve this is if all bagged trees predict zero. If we add noise to the
trees that bagging is averaging over, this noise will cause some trees to
predict values larger than 0 for this case, thus moving the average
Expand Down Expand Up @@ -146,7 +146,7 @@ Usage
The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.

:class:`CalibratedClassifierCV` uses a cross-validation approach to ensure
unbiased data is always used to fit the calibrator. The data is split into k
unbiased data is always used to fit the calibrator. The data is split into :math:`k`
`(train_set, test_set)` couples (as determined by `cv`). When `ensemble=True`
(default), the following procedure is repeated independently for each
cross-validation split:
Expand All @@ -157,13 +157,13 @@ cross-validation split:
regressor) (when the data is multiclass, a calibrator is fit for every class)

This results in an
ensemble of k `(classifier, calibrator)` couples where each calibrator maps
ensemble of :math:`k` `(classifier, calibrator)` couples where each calibrator maps
the output of its corresponding classifier into [0, 1]. Each couple is exposed
in the `calibrated_classifiers_` attribute, where each entry is a calibrated
classifier with a :term:`predict_proba` method that outputs calibrated
probabilities. The output of :term:`predict_proba` for the main
:class:`CalibratedClassifierCV` instance corresponds to the average of the
predicted probabilities of the `k` estimators in the `calibrated_classifiers_`
predicted probabilities of the :math:`k` estimators in the `calibrated_classifiers_`
list. The output of :term:`predict` is the class that has the highest
probability.

Expand Down Expand Up @@ -244,12 +244,12 @@ subject to :math:`\hat{f}_i \geq \hat{f}_j` whenever
:math:`f_i \geq f_j`. :math:`y_i` is the true
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
This method is more general when compared to 'sigmoid' as the only restriction
This method is more general when compared to `'sigmoid'` as the only restriction
is that the mapping function is monotonically increasing. It is thus more
powerful as it can correct any monotonic distortion of the un-calibrated model.
However, it is more prone to overfitting, especially on small datasets [6]_.

Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
Overall, `'isotonic'` will perform as well as or better than `'sigmoid'` when
there is enough data (greater than ~ 1000 samples) to avoid overfitting [3]_.

.. note:: Impact on ranking metrics like AUC
Expand Down
19 changes: 9 additions & 10 deletions doc/modules/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -372,8 +372,7 @@ Thus, one can create the training/test sets using numpy indexing::
Repeated K-Fold
^^^^^^^^^^^^^^^

:class:`RepeatedKFold` repeats K-Fold n times. It can be used when one
requires to run :class:`KFold` n times, producing different splits in
:class:`RepeatedKFold` repeats :class:`KFold` :math:`n` times, producing different splits in
each repetition.

Example of 2-fold K-Fold repeated 2 times::
Expand All @@ -392,7 +391,7 @@ Example of 2-fold K-Fold repeated 2 times::
[1 3] [0 2]


Similarly, :class:`RepeatedStratifiedKFold` repeats Stratified K-Fold n times
Similarly, :class:`RepeatedStratifiedKFold` repeats :class:`StratifiedKFold` :math:`n` times
with different randomization in each repetition.

.. _leave_one_out:
Expand Down Expand Up @@ -434,10 +433,10 @@ folds are virtually identical to each other and to the model built from the
entire training set.

However, if the learning curve is steep for the training size in question,
then 5- or 10- fold cross validation can overestimate the generalization error.
then 5 or 10-fold cross validation can overestimate the generalization error.

As a general rule, most authors, and empirical evidence, suggest that 5- or 10-
fold cross validation should be preferred to LOO.
As a general rule, most authors and empirical evidence suggest that 5 or 10-fold
cross validation should be preferred to LOO.

.. dropdown:: References

Expand Down Expand Up @@ -553,10 +552,10 @@ relative class frequencies are approximately preserved in each fold.

.. _stratified_k_fold:

Stratified k-fold
Stratified K-fold
^^^^^^^^^^^^^^^^^

:class:`StratifiedKFold` is a variation of *k-fold* which returns *stratified*
:class:`StratifiedKFold` is a variation of *K-fold* which returns *stratified*
folds: each set contains approximately the same percentage of samples of each
target class as the complete set.

Expand Down Expand Up @@ -648,10 +647,10 @@ parameter.

.. _group_k_fold:

Group k-fold
Group K-fold
^^^^^^^^^^^^

:class:`GroupKFold` is a variation of k-fold which ensures that the same group is
:class:`GroupKFold` is a variation of K-fold which ensures that the same group is
not represented in both testing and training sets. For example if the data is
obtained from different subjects with several samples per-subject and if the
model is flexible enough to learn from highly person specific features it
Expand Down
6 changes: 3 additions & 3 deletions doc/modules/kernel_ridge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Kernel ridge regression
.. currentmodule:: sklearn.kernel_ridge

Kernel ridge regression (KRR) [M2012]_ combines :ref:`ridge_regression`
(linear least squares with l2-norm regularization) with the `kernel trick
(linear least squares with :math:`L_2`-norm regularization) with the `kernel trick
<https://en.wikipedia.org/wiki/Kernel_method>`_. It thus learns a linear
function in the space induced by the respective kernel and the data. For
non-linear kernels, this corresponds to a non-linear function in the original
Expand All @@ -16,7 +16,7 @@ space.
The form of the model learned by :class:`KernelRidge` is identical to support
vector regression (:class:`~sklearn.svm.SVR`). However, different loss
functions are used: KRR uses squared error loss while support vector
regression uses :math:`\epsilon`-insensitive loss, both combined with l2
regression uses :math:`\epsilon`-insensitive loss, both combined with :math:`L_2`
regularization. In contrast to :class:`~sklearn.svm.SVR`, fitting
:class:`KernelRidge` can be done in closed-form and is typically faster for
medium-sized datasets. On the other hand, the learned model is non-sparse and
Expand All @@ -31,7 +31,7 @@ plotted, where both complexity/regularization and bandwidth of the RBF kernel
have been optimized using grid-search. The learned functions are very
similar; however, fitting :class:`KernelRidge` is approximately seven times
faster than fitting :class:`~sklearn.svm.SVR` (both with grid-search).
However, prediction of 100000 target values is more than three times faster
However, prediction of 100,000 target values is more than three times faster
with :class:`~sklearn.svm.SVR` since it has learned a sparse model using only
approximately 1/3 of the 100 training datapoints as support vectors.

Expand Down
14 changes: 7 additions & 7 deletions doc/modules/lda_qda.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,11 +173,11 @@ In this scenario, the empirical sample covariance is a poor
estimator, and shrinkage helps improving the generalization performance of
the classifier.
Shrinkage LDA can be used by setting the ``shrinkage`` parameter of
the :class:`~discriminant_analysis.LinearDiscriminantAnalysis` class to 'auto'.
the :class:`~discriminant_analysis.LinearDiscriminantAnalysis` class to `'auto'`.
This automatically determines the optimal shrinkage parameter in an analytic
way following the lemma introduced by Ledoit and Wolf [2]_. Note that
currently shrinkage only works when setting the ``solver`` parameter to 'lsqr'
or 'eigen'.
currently shrinkage only works when setting the ``solver`` parameter to `'lsqr'`
or `'eigen'`.

The ``shrinkage`` parameter can also be manually set between 0 and 1. In
particular, a value of 0 corresponds to no shrinkage (which means the empirical
Expand All @@ -192,7 +192,7 @@ best choice. For example if the distribution of the data
is normally distributed, the
Oracle Approximating Shrinkage estimator :class:`sklearn.covariance.OAS`
yields a smaller Mean Squared Error than the one given by Ledoit and Wolf's
formula used with shrinkage="auto". In LDA, the data are assumed to be gaussian
formula used with `shrinkage="auto"`. In LDA, the data are assumed to be gaussian
conditionally to the class. If these assumptions hold, using LDA with
the OAS estimator of covariance will yield a better classification
accuracy than if Ledoit and Wolf or the empirical covariance estimator is used.
Expand Down Expand Up @@ -239,17 +239,17 @@ computing :math:`S` and :math:`V` via the SVD of :math:`X` is enough. For
LDA, two SVDs are computed: the SVD of the centered input matrix :math:`X`
and the SVD of the class-wise mean vectors.

The 'lsqr' solver is an efficient algorithm that only works for
The `'lsqr'` solver is an efficient algorithm that only works for
classification. It needs to explicitly compute the covariance matrix
:math:`\Sigma`, and supports shrinkage and custom covariance estimators.
This solver computes the coefficients
:math:`\omega_k = \Sigma^{-1}\mu_k` by solving for :math:`\Sigma \omega =
\mu_k`, thus avoiding the explicit computation of the inverse
:math:`\Sigma^{-1}`.

The 'eigen' solver is based on the optimization of the between class scatter to
The `'eigen'` solver is based on the optimization of the between class scatter to
within class scatter ratio. It can be used for both classification and
transform, and it supports shrinkage. However, the 'eigen' solver needs to
transform, and it supports shrinkage. However, the `'eigen'` solver needs to
compute the covariance matrix, so it might not be suitable for situations with
a high number of features.

Expand Down
18 changes: 9 additions & 9 deletions doc/modules/partial_dependence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,11 +211,11 @@ Computation methods
===================

There are two main methods to approximate the integral above, namely the
'brute' and 'recursion' methods. The `method` parameter controls which method
`'brute'` and `'recursion'` methods. The `method` parameter controls which method
to use.

The 'brute' method is a generic method that works with any estimator. Note that
computing ICE plots is only supported with the 'brute' method. It
The `'brute'` method is a generic method that works with any estimator. Note that
computing ICE plots is only supported with the `'brute'` method. It
approximates the above integral by computing an average over the data `X`:

.. math::
Expand All @@ -231,7 +231,7 @@ at :math:`x_{S}`. Computing this for multiple values of :math:`x_{S}`, one
obtains a full ICE line. As one can see, the average of the ICE lines
corresponds to the partial dependence line.

The 'recursion' method is faster than the 'brute' method, but it is only
The `'recursion'` method is faster than the `'brute'` method, but it is only
supported for PDP plots by some tree-based estimators. It is computed as
follows. For a given point :math:`x_S`, a weighted tree traversal is performed:
if a split node involves an input feature of interest, the corresponding left
Expand All @@ -240,23 +240,23 @@ being weighted by the fraction of training samples that entered that branch.
Finally, the partial dependence is given by a weighted average of all the
visited leaves' values.

With the 'brute' method, the parameter `X` is used both for generating the
With the `'brute'` method, the parameter `X` is used both for generating the
grid of values :math:`x_S` and the complement feature values :math:`x_C`.
However with the 'recursion' method, `X` is only used for the grid values:
implicitly, the :math:`x_C` values are those of the training data.

By default, the 'recursion' method is used for plotting PDPs on tree-based
By default, the `'recursion'` method is used for plotting PDPs on tree-based
estimators that support it, and 'brute' is used for the rest.

.. _pdp_method_differences:

.. note::

While both methods should be close in general, they might differ in some
specific settings. The 'brute' method assumes the existence of the
specific settings. The `'brute'` method assumes the existence of the
data points :math:`(x_S, x_C^{(i)})`. When the features are correlated,
such artificial samples may have a very low probability mass. The 'brute'
and 'recursion' methods will likely disagree regarding the value of the
such artificial samples may have a very low probability mass. The `'brute'`
and `'recursion'` methods will likely disagree regarding the value of the
partial dependence, because they will treat these unlikely
samples differently. Remember, however, that the primary assumption for
interpreting PDPs is that the features should be independent.
Expand Down
Loading
0