8000 DOC Math/code formatting in docs (#31325) · scikit-learn/scikit-learn@8cfc72b · GitHub
[go: up one dir, main page]

Skip to content

Commit 8cfc72b

Browse files
authored
DOC Math/code formatting in docs (#31325)
1 parent ce4a40f commit 8cfc72b

File tree

6 files changed

+66
-66
lines changed

6 files changed

+66
-66
lines changed

doc/modules/calibration.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ difficulty making predictions near 0 and 1 because variance in the
103103
underlying base models will bias predictions that should be near zero or one
104104
away from these values. Because predictions are restricted to the interval
105105
[0,1], errors caused by variance tend to be one-sided near zero and one. For
106-
example, if a model should predict p = 0 for a case, the only way bagging
106+
example, if a model should predict :math:`p = 0` for a case, the only way bagging
107107
can achieve this is if all bagged trees predict zero. If we add noise to the
108108
trees that bagging is averaging over, this noise will cause some trees to
109109
predict values larger than 0 for this case, thus moving the average
@@ -146,7 +146,7 @@ Usage
146146
The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.
147147

148148
:class:`CalibratedClassifierCV` uses a cross-validation approach to ensure
149-
unbiased data is always used to fit the calibrator. The data is split into k
149+
unbiased data is always used to fit the calibrator. The data is split into :math:`k`
150150
`(train_set, test_set)` couples (as determined by `cv`). When `ensemble=True`
151151
(default), the following procedure is repeated independently for each
152152
cross-validation split:
@@ -157,13 +157,13 @@ cross-validation split:
157157
regressor) (when the data is multiclass, a calibrator is fit for every class)
158158

159159
This results in an
160-
ensemble of k `(classifier, calibrator)` couples where each calibrator maps
160+
ensemble of :math:`k` `(classifier, calibrator)` couples where each calibrator maps
161161
the output of its corresponding classifier into [0, 1]. Each couple is exposed
162162
in the `calibrated_classifiers_` attribute, where each entry is a calibrated
163163
classifier with a :term:`predict_proba` method that outputs calibrated
164164
probabilities. The output of :term:`predict_proba` for the main
165165
:class:`CalibratedClassifierCV` instance corresponds to the average of the
166-
predicted probabilities of the `k` estimators in the `calibrated_classifiers_`
166+
predicted probabilities of the :math:`k` estimators in the `calibrated_classifiers_`
167167
list. The output of :term:`predict` is the class that has the highest
168168
probability.
169169

@@ -244,12 +244,12 @@ subject to :math:`\hat{f}_i \geq \hat{f}_j` whenever
244244
:math:`f_i \geq f_j`. :math:`y_i` is the true
245245
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
246246
calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
247-
This method is more general when compared to 'sigmoid' as the only restriction
247+
This method is more general when compared to `'sigmoid'` as the only restriction
248248
is that the mapping function is monotonically increasing. It is thus more
249249
powerful as it can correct any monotonic distortion of the un-calibrated model.
250250
However, it is more prone to overfitting, especially on small datasets [6]_.
251251

252-
Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
252+
Overall, `'isotonic'` will perform as well as or better than `'sigmoid'` when
253253
there is enough data (greater than ~ 1000 samples) to avoid overfitting [3]_.
254254

255255
.. note:: Impact on ranking metrics like AUC

doc/modules/cross_validation.rst

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -372,8 +372,7 @@ Thus, one can create the training/test sets using numpy indexing::
372372
Repeated K-Fold
373373
^^^^^^^^^^^^^^^
374374

375-
:class:`RepeatedKFold` repeats K-Fold n times. It can be used when one
376-
requires to run :class:`KFold` n times, producing different splits in
375+
:class:`RepeatedKFold` repeats :class:`KFold` :math:`n` times, producing different splits in
377376
each repetition.
378377

379378
Example of 2-fold K-Fold repeated 2 times::
@@ -392,7 +391,7 @@ Example of 2-fold K-Fold repeated 2 times::
392391
[1 3] [0 2]
393392

394393

395-
Similarly, :class:`RepeatedStratifiedKFold` repeats Stratified K-Fold n times
394+
Similarly, :class:`RepeatedStratifiedKFold` repeats :class:`StratifiedKFold` :math:`n` times
396395
with different randomization in each repetition.
397396

398397
.. _leave_one_out:
@@ -434,10 +433,10 @@ folds are virtually identical to each other and to the model built from the
434433
entire training set.
435434

436435
However, if the learning curve is steep for the training size in question,
437-
then 5- or 10- fold cross validation can overestimate the generalization error.
436+
then 5 or 10-fold cross validation can overestimate the generalization error.
438437

439-
As a general rule, most authors, and empirical evidence, suggest that 5- or 10-
440-
fold cross validation should be preferred to LOO.
438+
As a general rule, most authors and empirical evidence suggest that 5 or 10-fold
439+
cross validation should be preferred to LOO.
441440

442441
.. dropdown:: References
443442

@@ -553,10 +552,10 @@ relative class frequencies are approximately preserved in each fold.
553552

554553
.. _stratified_k_fold:
555554

556-
Stratified k-fold
555+
Stratified K-fold
557556
^^^^^^^^^^^^^^^^^
558557

559-
:class:`StratifiedKFold` is a variation of *k-fold* which returns *stratified*
558+
:class:`StratifiedKFold` is a variation of *K-fold* which returns *stratified*
560559
folds: each set contains approximately the same percentage of samples of each
561560
target class as the complete set.
562561

@@ -648,10 +647,10 @@ parameter.
648647

649648
.. _group_k_fold:
650649

651-
Group k-fold
650+
Group K-fold
652651
^^^^^^^^^^^^
653652

654-
:class:`GroupKFold` is a variation of k-fold which ensures that the same group is
653+
:class:`GroupKFold` is a variation of K-fold which ensures that the same group is
655654
not represented in both testing and training sets. For example if the data is
656655
obtained from different subjects with several samples per-subject and if the
657656
model is flexible enough to learn from highly person specific features it

doc/modules/kernel_ridge.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Kernel ridge regression
77
.. currentmodule:: sklearn.kernel_ridge
88

99
Kernel ridge regression (KRR) [M2012]_ combines :ref:`ridge_regression`
10-
(linear least squares with l2-norm regularization) with the `kernel trick
10+
(linear least squares with :math:`L_2`-norm regularization) with the `kernel trick
1111
<https://en.wikipedia.org/wiki/Kernel_method>`_. It thus learns a linear
1212
function in the space induced by the respective kernel and the data. For
1313
non-linear kernels, this corresponds to a non-linear function in the original
@@ -16,7 +16,7 @@ space.
1616
The form of the model learned by :class:`KernelRidge` is identical to support
1717
vector regression (:class:`~sklearn.svm.SVR`). However, different loss
1818
functions are used: KRR uses squared error loss while support vector
19-
regression uses :math:`\epsilon`-insensitive loss, both combined with l2
19+
regression uses :math:`\epsilon`-insensitive loss, both combined with :math:`L_2`
2020
regularization. In contrast to :class:`~sklearn.svm.SVR`, fitting
2121
:class:`KernelRidge` can be done in closed-form and is typically faster for
2222
medium-sized datasets. On the other hand, the learned model is non-sparse and
@@ -31,7 +31,7 @@ plotted, where both complexity/regularization and bandwidth of the RBF kernel
3131
have been optimized using grid-search. The learned functions are very
3232
similar; however, fitting :class:`KernelRidge` is approximately seven times
3333
faster than fitting :class:`~sklearn.svm.SVR` (both with grid-search).
34-
However, prediction of 100000 target values is more than three times faster
34+
However, prediction of 100,000 target values is more than three times faster
3535
with :class:`~sklearn.svm.SVR` since it has learned a sparse model using only
3636
approximately 1/3 of the 100 training datapoints as support vectors.
3737

doc/modules/lda_qda.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -173,11 +173,11 @@ In this scenario, the empirical sample covariance is a poor
173173
estimator, and shrinkage helps improving the generalization performance of
174174
the classifier.
175175
Shrinkage LDA can be used by setting the ``shrinkage`` parameter of
176-
the :class:`~discriminant_analysis.LinearDiscriminantAnalysis` class to 'auto'.
176+
the :class:`~discriminant_analysis.LinearDiscriminantAnalysis` class to `'auto'`.
177177
This automatically determines the optimal shrinkage parameter in an analytic
178178
way following the lemma introduced by Ledoit and Wolf [2]_. Note that
179-
currently shrinkage only works when setting the ``solver`` parameter to 'lsqr'
180-
or 'eigen'.
179+
currently shrinkage only works when setting the ``solver`` parameter to `'lsqr'`
180+
or `'eigen'`.
181181

182182
The ``shrinkage`` parameter can also be manually set between 0 and 1. In
183183
particular, a value of 0 corresponds to no shrinkage (which means the empirical
@@ -192,7 +192,7 @@ best choice. For example if the distribution of the data
192192
is normally distributed, the
193193
Oracle Approximating Shrinkage estimator :class:`sklearn.covariance.OAS`
194194
yields a smaller Mean Squared Error than the one given by Ledoit and Wolf's
195-
formula used with shrinkage="auto". In LDA, the data are assumed to be gaussian
195+
formula used with `shrinkage="auto"`. In LDA, the data are assumed to be gaussian
196196
conditionally to the class. If these assumptions hold, using LDA with
197197
the OAS estimator of covariance will yield a better classification
198198
accuracy than if Ledoit and Wolf or the empirical covariance estimator is used.
@@ -239,17 +239,17 @@ computing :math:`S` and :math:`V` via the SVD of :math:`X` is enough. For
239239
LDA, two SVDs are computed: the SVD of the centered input matrix :math:`X`
240240
and the SVD of the class-wise mean vectors.
241241

242-
The 'lsqr' solver is an efficient algorithm that only works for
242+
The `'lsqr'` solver is an efficient algorithm that only works for
243243
classification. It needs to explicitly compute the covariance matrix
244244
:math:`\Sigma`, and supports shrinkage and custom covariance estimators.
245245
This solver computes the coefficients
246246
:math:`\omega_k = \Sigma^{-1}\mu_k` by solving for :math:`\Sigma \omega =
247247
\mu_k`, thus avoiding the explicit computation of the inverse
248248
:math:`\Sigma^{-1}`.
249249

250-
The 'eigen' solver is based on the optimization of the between class scatter to
250+
The `'eigen'` solver is based on the optimization of the between class scatter to
251251
within class scatter ratio. It can be used for both classification and
252-
transform, and it supports shrinkage. However, the 'eigen' solver needs to
252+
transform, and it supports shrinkage. However, the `'eigen'` solver needs to
253253
compute the covariance matrix, so it might not be suitable for situations with
254254
a high number of features.
255255

doc/modules/partial_dependence.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -211,11 +211,11 @@ Computation methods
211211
===================
212212

213213
There are two main methods to approximate the integral above, namely the
214-
'brute' and 'recursion' methods. The `method` parameter controls which method
214+
`'brute'` and `'recursion'` methods. The `method` parameter controls which method
215215
to use.
216216

217-
The 'brute' method is a generic method that works with any estimator. Note that
218-
computing ICE plots is only supported with the 'brute' method. It
217+
The `'brute'` method is a generic method that works with any estimator. Note that
218+
computing ICE plots is only supported with the `'brute'` method. It
219219
approximates the above integral by computing an average over the data `X`:
220220

221221
.. math::
@@ -231,7 +231,7 @@ at :math:`x_{S}`. Computing this for multiple values of :math:`x_{S}`, one
231231
obtains a full ICE line. As one can see, the average of the ICE lines
232232
corresponds to the partial dependence line.
233233

234-
The 'recursion' method is faster than the 'brute' method, but it is only
234+
The `'recursion'` method is faster than the `'brute'` method, but it is only
235235
supported for PDP plots by some tree-based estimators. It is computed as
236236
follows. For a given point :math:`x_S`, a weighted tree traversal is performed:
237237
if a split node involves an input feature of interest, the corresponding left
@@ -240,23 +240,23 @@ being weighted by the fraction of training samples that entered that branch.
240240
Finally, the partial dependence is given by a weighted average of all the
241241
visited leaves' values.
242242

243-
With the 'brute' method, the parameter `X` is used both for generating the
243+
With the `'brute'` method, the parameter `X` is used both for generating the
244244
grid of values :math:`x_S` and the complement feature values :math:`x_C`.
245245
However with the 'recursion' method, `X` is only used for the grid values:
246246
implicitly, the :math:`x_C` values are those of the training data.
247247

248-
By default, the 'recursion' method is used for plotting PDPs on tree-based
248+
By default, the `'recursion'` method is used for plotting PDPs on tree-based
249249
estimators that support it, and 'brute' is used for the rest.
250250

251251
.. _pdp_method_differences:
252252

253253
.. note::
254254

255255
While both methods should be close in general, they might differ in some
256-
specific settings. The 'brute' method assumes the existence of the
256+
specific settings. The `'brute'` method assumes the existence of the
257257
data points :math:`(x_S, x_C^{(i)})`. When the features are correlated,
258-
such artificial samples may have a very low probability mass. The 'brute'
259-
and 'recursion' methods will likely disagree regarding the value of the
258+
such artificial samples may have a very low probability mass. The `'brute'`
259+
and `'recursion'` methods will likely disagree regarding the value of the
260260
partial dependence, because they will treat these unlikely
261261
samples differently. Remember, however, that the primary assumption for
262262
interpreting PDPs is that the features should be independent.

0 commit comments

Comments
 (0)
0