8000 [MRG+1] QuantileTransformer (#8363) · raghavrv/scikit-learn@26a1027 · GitHub
[go: up one dir, main page]

Skip to content

Commit 26a1027

Browse files
glemaitreGaelVaroquaux
authored andcommitted
[MRG+1] QuantileTransformer (scikit-learn#8363)
* resurrect quantile scaler * move the code in the pre-processing module * first draft * Add tests. * Fix bug in QuantileNormalizer. * Add quantile_normalizer. * Implement pickling * create a specific function for dense transform * Create a fit function for the dense case * Create a toy examples * First draft with sparse matrices * remove useless functions and non-negative sparse compatibility * fix slice call * Fix tests of QuantileNormalizer. * Fix estimator compatibility * List of functions became tuple of functions * Check X consistency at transform and inverse transform time * fix doc * Add negative ValueError tests for QuantileNormalizer. * Fix cosmetics * Fix compatibility numpy <= 1.8 * Add n_features tests and correct ValueError. * PEP8 * fix fill_value for early scipy compatibility * simplify sampling * Fix tests. * removing last pring * Change choice for permutation * cosmetics * fix remove remaining choice * DOC * Fix inconsistencies * pep8 * Add checker for init parameters. * hack bounds and make a test * FIX/TST bounds are provided by the fitting and not X at transform * PEP8 * FIX/TST axis should be <= 1 * PEP8 * ENH Add parameter ignore_implicit_zeros * ENH match output distribution * ENH clip the data to avoid infinity due to output PDF * FIX ENH restraint to uniform and norm * [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2) * ENH Add example comparing the distribution of all scaling preprocessor * Remove Jupyter notebook convert * FIX/ENH Select feat before not after; Plot interquantile data range for all * Add heatmap legend * Remove comment maybe? * Move doc from robust_scaling to plot_all_scaling; Need to update doc * Update the doc * Better aesthetics; Better spacing and plot colormap only at end * Shameless author re-ordering ;P * Use env python for she-bang * TST Validity of output_pdf * EXA Use OrderedDict; Make it easier to add more transformations * FIX PEP8 and replace scipy.stats by str in example * FIX remove useless import * COSMET change variable names * FIX change output_pdf occurence to output_distribution * FIX partial fixies from comments * COMIT change class name and code structure * COSMIT change direction to inverse * FIX factorize transform in _transform_col * PEP8 * FIX change the magic 10 * FIX add interp1d to fixes * FIX/TST allow negative entries when ignore_implicit_zeros is True * FIX use np.interp instead of sp.interpolate.interp1d * FIX/TST fix tests * DOC start checking doc * TST add test to check the behaviour of interp numpy * TST/EHN Add the possibility to add noise to compute quantile * FIX factorize quantile computation * FIX fixes issues * PEP8 * FIX/DOC correct doc * TST/DOC improve doc and add random state * EXA add examples to illustrate the use of smoothing_noise * FIX/DOC fix some grammar * DOC fix example * DOC/EXA make plot titles more succint * EXA improve explanation * EXA improve the docstring * DOC add a bit more documentation * FIX advance review * TST add subsampling test * DOC/TST better example for the docstring * DOC add ellipsis to docstring * FIX address olivier comments * FIX remove random_state in sparse.rand * FIX spelling doc * FIX cite example in user guide and docstring * FIX olivier comments * EHN improve the example comparing all the pre-processing methods * FIX/DOC remove title * FIX change the scaling of the figure * FIX plotting layout * FIX ratio w/h * Reorder and reword the plot_all_scaling example * Fix aspect ratio and better explanations in the plot_all_scaling.py example * Fix broken link and remove useless sentence * FIX fix couples of spelling * FIX comments joel * FIX/DOC address documentation comments * FIX address comments joel * FIX inline sparse and dense transform * PEP8 * TST/DOC temporary skipping test * FIX raise an error if n_quantiles > subsample * FIX wording in smoothing_noise example * EXA Denis comments * FIX rephrasing * FIX make smoothing_noise to be a boolearn and change doc * FIX address comments * FIX verbose the doc slightly more * PEP8/DOC * ENH: 2-ways interpolation to avoid smoothing_noise Simplifies also the code, examples, and documentation
1 parent 494a240 commit 26a1027

File tree

9 files changed

+1296
-105
lines changed

9 files changed

+1296
-105
lines changed

build_tools/travis/flake8_diff.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,8 +137,8 @@ check_files() {
137137
if [[ "$MODIFIED_FILES" == "no_match" ]]; then
138138
echo "No file outside sklearn/externals and doc/sphinxext/sphinx_gallery has been modified"
139139
else
140-
check_files "$(echo "$MODIFIED_FILES" | grep -v ^examples)"
140+
check_files "$(echo "$MODIFIED_FILES" | grep -v ^examples)" --ignore=W503
141141
# Examples are allowed to not have imports at top of file
142-
check_files "$(echo "$MODIFIED_FILES" | grep ^examples)" --ignore=E402
142+
check_files "$(echo "$MODIFIED_FILES" | grep ^examples)" --ignore=E402,W503
143143
fi
144144
echo -e "No problem detected by flake8\n"

doc/modules/classes.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1198,6 +1198,7 @@ See the :ref:`metrics` section of the user guide for further details.
11981198
preprocessing.Normalizer
11991199
preprocessing.OneHotEncoder
12001200
preprocessing.PolynomialFeatures
1201+
preprocessing.QuantileTransformer
12011202
preprocessing.RobustScaler
12021203
preprocessing.StandardScaler
12031204

@@ -1211,6 +1212,7 @@ See the :ref:`metrics` section of the user guide for further details.
12111212
preprocessing.maxabs_scale
12121213
preprocessing.minmax_scale
12131214
preprocessing.normalize
1215+
preprocessing.quantile_transform
12141216
preprocessing.robust_scale
12151217
preprocessing.scale
12161218

doc/modules/preprocessing.rst

Lines changed: 78 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,13 @@ The ``sklearn.preprocessing`` package provides several common
1010
utility functions and transformer classes to change raw feature vectors
1111
into a representation that is more suitable for the downstream estimators.
1212

13+
In general, learning algorithms benefit from standardization of the data set. If
14+
some outliers are present in the set, robust scalers or transformers are more
15+
appropriate. The behaviors of the different scalers, transformers, and
16+
normalizers on a dataset containing marginal outliers is highlighted in
17+
:ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`.
18+
19+
1320
.. _preprocessing_scaler:
1421

1522
Standardization, or mean removal and variance scaling
@@ -39,10 +46,10 @@ operation on a single array-like dataset::
3946

4047
>>> from sklearn import preprocessing
4148
>>> import numpy as np
42-
>>> X = np.array([[ 1., -1., 2.],
43-
... [ 2., 0., 0.],
44-
... [ 0., 1., -1.]])
45-
>>> X_scaled = preprocessing.scale(X)
49+
>>> X_train = np.array([[ 1., -1., 2.],
50+
... [ 2., 0., 0.],
51+
... [ 0., 1., -1.]])
52+
>>> X_scaled = preprocessing.scale(X_train)
4653

4754
>>> X_scaled # doctest: +ELLIPSIS
4855
array([[ 0. ..., -1.22..., 1.33...],
@@ -71,7 +78,7 @@ able to later reapply the same transformation on the testing set.
7178
This class is hence suitable for use in the early steps of a
7279
:class:`sklearn.pipeline.Pipeline`::
7380

74-
>>> scaler = preprocessing.StandardScaler().fit(X)
81+
>>> scaler = preprocessing.StandardScaler().fit(X_train)
7582
>>> scaler
7683
StandardScaler(copy=True, with_mean=True, with_std=True)
7784

@@ -81,7 +88,7 @@ This class is hence suitable for use in the early steps of a
8188
>>> scaler.scale_ # doctest: +ELLIPSIS
8289
array([ 0.81..., 0.81..., 1.24...])
8390

84-
>>> scaler.transform(X) # doctest: +ELLIPSIS
91+
>>> scaler.transform(X_train) # doctest: +ELLIPSIS
8592
array([[ 0. ..., -1.22..., 1.33...],
8693
[ 1.22..., 0. ..., -0.26...],
8794
[-1.22..., 1.22..., -1.06...]])
@@ -90,7 +97,8 @@ This class is hence suitable for use in the early steps of a
9097
The scaler instance can then be used on new data to transform it the
9198
same way it did on the training set::
9299

93-
>>> scaler.transform([[-1., 1., 0.]]) # doctest: +ELLIPSIS
100+
>>> X_test = [[-1., 1., 0.]]
101+
>>> scaler.transform(X_test) # doctest: +ELLIPSIS
94102
array([[-2.44..., 1.22..., -0.26...]])
95103

96104
It is possible to disable either centering or scaling by either
@@ -248,6 +256,69 @@ a :class:`KernelCenterer` can transform the kernel matrix
248256
so that it contains inner products in the feature space
249257
defined by :math:`phi` followed by removal of the mean in that space.
250258

259+
.. _preprocessing_transformer:
260+
261+
Non-linear transformation
262+
=========================
263+
264+
Like scalers, :class:`QuantileTransformer` puts each feature into the same
265+
range or distribution. However, by performing a rank transformation, it smooths
266+
out unusual distributions and is less influenced by outliers than scaling
267+
methods. It does, however, distort correlations and distances within and across
268+
features.
269+
270+
:class:`QuantileTransformer` and :func:`quantile_transform` provide a
271+
non-parametric transformation based on the quantile function to map the data to
272+
a uniform distribution with values between 0 and 1::
273+
274+
>>> from sklearn.datasets import load_iris
275+
>>> from sklearn.model_selection import train_test_split
276+
>>> iris = load_iris()
277+
>>> X, y = iris.data, iris.target
278+
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
279+
>>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
280+
>>> X_train_trans = quantile_transformer.fit_transform(X_train)
281+
>>> X_test_trans = quantile_transformer.transform(X_test)
282+
>>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) # doctest: +SKIP
283+
array([ 4.3, 5.1, 5.8, 6.5, 7.9])
284+
285+
This feature corresponds to the sepal length in cm. Once the quantile
286+
transformation applied, those landmarks approach closely the percentiles
287+
previously defined::
288+
289+
>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
290+
... # doctest: +ELLIPSIS +SKIP
291+
array([ 0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])
292+
293+
This can be confirmed on a independent testing set with similar remarks::
294+
295+
>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100])
296+
... # doctest: +SKIP
297+
array([ 4.4 , 5.125, 5.75 , 6.175, 7.3 ])
298+
>>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])
299+
... # doctest: +ELLIPSIS +SKIP
300+
array([ 0.01..., 0.25..., 0.46..., 0.60... , 0.94...])
301+
302+
It is also possible to map the transformed data to a normal distribution by
303+
setting ``output_distribution='normal'``::
304+
305+
>>> quantile_transformer = preprocessing.QuantileTransformer(
306+
... output_distribution='normal', random_state=0)
307+
>>> X_trans = quantile_transformer.fit_transform(X)
308+
>>> quantile_transformer.quantiles_ # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
309+
array([[ 4.3..., 2..., 1..., 0.1...],
310+
[ 4.31..., 2.02..., 1.01..., 0.1...],
311+
[ 4.32..., 2.05..., 1.02..., 0.1...],
312+
...,
313+
[ 7.84..., 4.34..., 6.84..., 2.5...],
314+
[ 7.87..., 4.37..., 6.87..., 2.5...],
315+
[ 7.9..., 4.4..., 6.9..., 2.5...]])
316+
317+
Thus the median of the input becomes the mean of the output, centered at 0. The
318+
normal output is clipped so that the input's minimum and maximum ---
319+
corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively --- do not
320+
become infinite under the transformation.
321+
251322
.. _preprocessing_normalization:
252323

253324
Normalization

doc/whats_new.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,13 @@ New features
6262
during the first epochs of ridge and logistic regression.
6363
By `Arthur Mensch`_.
6464

65+
- Added :class:`preprocessing.QuantileTransformer` class and
66+
:func:`preprocessing.quantile_transform` function for features
67+
normalization based on quantiles.
68+
:issue:`8363` by :user:`Denis Engemann <dengemann>`,
69+
:user:`Guillaume Lemaitre <glemaitre>`, `Olivier Grisel`_, `Raghav RV`_,
70+
:user:`Thierry Guillemot <tguillemot>`_, and `Gael Varoquaux`_.
71+
6572
Enhancements
6673
............
6774
@@ -172,7 +179,7 @@ Enhancements
172179
- Add ``sample_weight`` parameter to :func:`metrics.cohen_kappa_score` by
173180
Victor Poughon.
174181

175-
- In :class:`gaussian_process.GaussianProcessRegressor`, method ``predict``
182+
- In :class:`gaussian_process.GaussianProcessRegressor`, method ``predict``
176183
is a lot faster with ``return_std=True`` by :user:`Hadrien Bertrand <hbertrand>`.
177184

178185
- Added ability to use sparse matrices in :func:`feature_selection.f_regression`
@@ -331,7 +338,7 @@ Bug fixes
331338
both ``'binary'`` but the union of ``y_true`` and ``y_pred`` was
332339
``'multiclass'``. :issue:`8377` by `Loic Esteve`_.
333340

334-
- Fix :func:`sklearn.linear_model.BayesianRidge.fit` to return
341+
- Fix :func:`sklearn.linear_model.BayesianRidge.fit` to return
335342
ridge parameter `alpha_` and `lambda_` consistent with calculated
336343
coefficients `coef_` and `intercept_`.
337344
:issue:`8224` by :user:`Peter Gedeck <gedeck>`.

0 commit comments

Comments
 (0)
0