8000 [MRG+2] Add fixed width discretization to scikit-learn by hlin117 · Pull Request #7668 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Add fixed width discretization to scikit-learn #7668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 42 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
285c80e
Issue #5778: Proof of concept rewrite of fixed width discretization
hlin117 Oct 14, 2016
230bc35
Removing copy parameter
hlin117 Oct 14, 2016
6b61f7b
Improved postprocess clipping
hlin117 Oct 15, 2016
1a7f6f1
A few changes
hlin117 Oct 22, 2016
face238
Wrote all code for new fitting function
hlin117 Oct 22, 2016
503ef2f
Adding comma to __init__.py
hlin117 Oct 23, 2016
960e5ca
Using relative imports to pass tests
hlin117 Oct 23, 2016
ddb0357
Pyflakes
hlin117 Oct 23, 2016
914d639
Fixing flake8 errors
hlin117 Oct 28, 2016
f15c06c
Addressing code comments
Feb 19, 2017
86d3e32
Testing n_bin array case
Feb 19, 2017
620fd63
Addressing jnothman's comments
Feb 20, 2017
f470f10
fix doctest for python3
Feb 20, 2017
08220d2
set clip_min and clip_max to be private, and removing n_features attr
Feb 20, 2017
ce96c63
Addressed code comments
Feb 22, 2017
8621dbd
Editing documentation, bug fixes, and implementing inverse_transform
Feb 22, 2017
391a9e2
Superficial changes
Feb 26, 2017
1b43a19
Cleaning code up
Feb 27, 2017
cfaaa4e
Modified and used _transform_selected, fixed build
Mar 1, 2017
f3f64d1
Removing unnecessary change to _check_transform_selected
Mar 1, 2017
942d132
Updating based on comments
Mar 26, 2017
becd582
Forgot about flake8
Mar 26, 2017
2753c53
Addressed a small comment change
Apr 7, 2017
b569d56
Spaces before comments
Apr 7, 2017
1d5c91c
Preventing numeric instability errors
Apr 17, 2017
49e8a33
Last code comments
hlin117 May 2, 2017
994db71
min -> offset
hlin117 May 2, 2017
681b027
Minor code comments
hlin117 May 12, 2017
f11e085
Avoiding adding 0.5 to values which would be truncated
hlin117 May 13, 2017
04553f5
Reorganizing the rst file for discretization
hlin117 May 14, 2017
87e571f
Small documentation fix
hlin117 May 14, 2017
8466653
Using original discretization correction method, test nano case
hlin117 May 14, 2017
075ced8
Simplifying numeric stability code
hlin117 May 18, 2017
d813ecb
Flake 8
hlin117 May 18, 2017
8144abb
Removing use of isclose
hlin117 Jun 3, 2017
3e67b2e
Putting back feature binarization tag
hlin117 Jun 7, 2017
0189c29
Merge branch 'master' into dis2
jnothman Jun 19, 2017
b397dd4
Taking into account of @glemaitre's comments
hlin117 Jun 24, 2017
953cde8
Still addressing code comments
hlin117 Jun 24, 2017
7bf4e1f
Still hacking at comments
hlin117 Jun 24, 2017
be49720
Type checking n_bins, and other nits
hlin117 Jul 2, 2017
54a0dc7
Fixing doctests
hlin117 Jul 2, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1190,6 +1190,7 @@ See the :ref:`metrics` section of the user guide for further details.
preprocessing.Binarizer
preprocessing.FunctionTransformer
preprocessing.Imputer
preprocessing.KBinsDiscretizer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the rendered page:

screen shot 2017-02-26 at 3 43 09 am

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do some subdivision of this section... but not in this PR.

preprocessing.KernelCenterer
preprocessing.LabelBinarizer
preprocessing.LabelEncoder
Expand Down
162 changes: 103 additions & 59 deletions doc/modules/preprocessing.rst
628C
Original file line number Diff line number Diff line change
Expand Up @@ -381,10 +381,107 @@ The normalizer instance can then be used on sample vectors as any transformer::
efficient Cython routines. To avoid unnecessary memory copies, it is
recommended to choose the CSR representation upstream.

.. _preprocessing_binarization:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This heading is referenced elsewhere and needs to be somewhere...

Copy link
Contributor Author
@hlin117 hlin117 Jun 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the master branch, I did grep preprocessing_binarization --include="*.rst" . -nr, and found that this was the only instance where this header was being called. Do you still think that it's being referenced somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it referenced in the Binarizer docstring last I checked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. You're right, it's in the Binarizer docstring.

.. _preprocessing_categorical_features:

Encoding categorical features
=============================
Often features are not given as continuous values but categorical.
For example a person could have features ``["male", "female"]``,
``["from Europe", "from US", "from Asia"]``,
``["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]``.
Such features can be efficiently coded as integers, for instance
``["male", "from US", "uses Internet Explorer"]`` could be expressed as
``[0, 1, 3]`` while ``["female", "from Asia", "uses Chrome"]`` would be
``[1, 2, 1]``.

Such integer representation can not be used directly with scikit-learn estimators, as these
expect continuous input, and would interpret the categories as being ordered, which is often
not desired (i.e. the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
implemented in :class:`OneHotEncoder`. This estimator transforms each
categorical feature with ``m`` possible values into ``m`` binary features, with
only one active.

Continuing the example above::

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

By default, how many values each feature can take is inferred automatically from the dataset.
It is possible to specify this explicitly using the parameter ``n_values``.
There are two genders, three possible continents and four web browsers in our
dataset.
Then we fit the estimator, and transform a data point.
In the result, the first two numbers encode the gender, the next set of three
numbers the continent and the last four the web browser.

Note that, if there is a possibilty that the training data might have missing categorical
features, one has to explicitly set ``n_values``. For example,

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])

See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.

.. _discretization:

Discretization
==============

`Discretization <https://en.wikipedia.org/wiki/Discretization_of_continuous_features>`_
(otherwise known as quantization or binning) provides a way to partition continuous
features into discrete values. Certain datasets with continuous features
may benefit from discretization, because discretization can transform the dataset
of continuous attributes to one with only nominal attributes.

K-bins discretization
---------------------

:class:`KBinsDiscretizer` discretizers features into ``k`` equal width bins::

>>> X = np.array([[ -3., 5., 15 ],
... [ 0., 6., 14 ],
... [ 6., 3., 11 ]])
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 3, 2]).fit(X)
>>> est.bin_width_
array([ 3., 1., 2.])

For each feature, the bin width is computed during ``fit`` and together with
the number of bins, they will define the intervals. Therefore, for the current
example, these intervals are defined as:

- feature 1: :math:`{[-\infty, 0), [0, 3), [3, \infty)}`
- feature 2: :math:`{[-\infty, 4), [4, 5), [5, \infty)}`
- feature 3: :math:`{[-\infty, 13), [13, \infty)}`

Binarization
============
Based on these bin intervals, ``X`` is transformed as follows::

>>> est.transform(X) # doctest: +SKIP
array([[ 0., 2., 1.],
[ 1., 2., 1.],
[ 2., 0., 0.]])

The resulting dataset contains ordinal attributes which can be further used
in a :class:`sklearn.pipeline.Pipeline`.

Discretization is similar to constructing histograms for continuous data.
However, histograms focus on counting features which fall into particular
bins, whereas discretization focuses on assigning feature values to these bins.

.. _preprocessing_binarization:

Feature binarization
--------------------
Expand Down Expand Up @@ -431,6 +528,9 @@ As for the :class:`StandardScaler` and :class:`Normalizer` classes, the
preprocessing module provides a companion function :func:`binarize`
to be used when the transformer API is not necessary.

Note that the :class:`Binarizer` is similar to the :class:`KBinsDiscretizer`
when ``k = 2``, and when the bin edge is at the value ``threshold``.

.. topic:: Sparse input

:func:`binarize` and :class:`Binarizer` accept **both dense array-like
Expand All @@ -441,62 +541,6 @@ to be used when the transformer API is not necessary.
To avoid unnecessary memory copies, it is recommended to choose the CSR
representation upstream.


.. _preprocessing_categorical_features:

Encoding categorical features
=============================
Often features are not given as continuous values but categorical.
For example a person could have features ``["male", "female"]``,
``["from Europe", "from US", "from Asia"]``,
``["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]``.
Such features can be efficiently coded as integers, for instance
``["male", "from US", "uses Internet Explorer"]`` could be expressed as
``[0, 1, 3]`` while ``["female", "from Asia", "uses Chrome"]`` would be
``[1, 2, 1]``.

Such integer representation can not be used directly with scikit-learn estimators, as these
expect continuous input, and would interpret the categories as being ordered, which is often
not desired (i.e. the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
implemented in :class:`OneHotEncoder`. This estimator transforms each
categorical feature with ``m`` possible values into ``m`` binary features, with
only one active.

Continuing the example above::

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

By default, how many values each feature can take is inferred automatically from the dataset.
It is possible to specify this explicitly using the parameter ``n_values``.
There are two genders, three possible continents and four web browsers in our
dataset.
Then we fit the estimator, and transform a data point.
In the result, the first two numbers encode the gender, the next set of three
numbers the continent and the last four the web browser.

Note that, if there is a possibilty that the training data might have missing categorical
features, one has to explicitly set ``n_values``. For example,

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])

See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.

.. _imputation:

Imputation of missing values
Expand Down
2 changes: 2 additions & 0 deletions sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from .data import OneHotEncoder

from .data import PolynomialFeatures
from .discretization import KBinsDiscretizer

from .label import label_binarize
from .label import LabelBinarizer
Expand All @@ -37,6 +38,7 @@
'Binarizer',
'FunctionTransformer',
'Imputer',
'KBinsDiscretizer',
'KernelCenterer',
'LabelBinarizer',
'LabelEncoder',
Expand Down
48 changes: 36 additions & 12 deletions sklearn/preprocessing/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -1677,8 +1677,12 @@ def add_dummy_feature(X, value=1.0):
return np.hstack((np.ones((n_samples, 1)) * value, X))


def _transform_selected(X, transform, selected="all", copy=True):
"""Apply a transform function to portion of selected features
def _transform_selected(X, transform, selected="all", copy=True,
retain_order=False):
"""Apply a transform function to portion of selected features.

Returns an array Xt, where the non-selected features appear on the right
side (largest column indices) of Xt.

Parameters
----------
Expand All @@ -1688,18 +1692,28 @@ def _transform_selected(X, transform, selected="all", copy=True):
transform : callable
A callable transform(X) -> X_transformed

copy : boolean, optional
copy : boolean, default=True
Copy X even if it could be avoided.

selected: "all" or array of indices or mask
selected : "all" or array of indices or mask
Specify which features to apply the transform to.

retain_order : boolean, default=False
If True, the non-selected features will not be displaced to the right
side of the transformed array. The number of features in Xt must
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but reinserted into Xt"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the non-selected features aren't reinserted into Xt; they're just not modified by the transformation.

match the number of features in X. Furthermore, X and Xt cannot be
sparse.

Returns
-------
X : array or sparse matrix, shape=(n_samples, n_features_new)
Xt : array or sparse matrix, shape=(n_samples, n_features_new)
"""
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

if sparse.issparse(X) and retain_order:
raise ValueError("The retain_order option can only be set to True "
"for dense matrices.")

if isinstance(selected, six.string_types) and selected == "all":
return transform(X)

Expand All @@ -1719,14 +1733,24 @@ def _transform_selected(X, transform, selected="all", copy=True):
elif n_selected == n_features:
# All features selected.
return transform(X)
else:
X_sel = transform(X[:, ind[sel]])
X_not_sel = X[:, ind[not_sel]]

if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):
return sparse.hstack((X_sel, X_not_sel))
else:
return np.hstack((X_sel, X_not_sel))
X_sel = transform(X[:, ind[sel]])
X_not_sel = X[:, ind[not_sel]]

if retain_order:
if X_sel.shape[1] + X_not_sel.shape[1] != n_features:
raise ValueError("The retain_order option can only be set to True "
"if the dimensions of the input array match the "
"dimensions of the transformed array.")

# Fancy indexing not supported for sparse matrices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it's not particularly hard to do this kind of splicing operation. Three options:

  • create a sparse diagonal matrix D of size (X_sel.shape[1], X.shape[1]) such that X_sel @ D effectively inserts empty columns in X_sel. Then do the same for X_not_sel and sum them together. (Easiest to read?)
  • convert to COO, map .col to the new space in both X_sel and X_not_sel, then create a new COO from the concatenation of their row, col and data.
  • convert to CSC, and use np.insert on indices, on data and on np.diff(indptr) to define the output matrix. (Most efficient but hardest to implement.)

But it's fine being crippled for now.

X[:, ind[sel]] = X_sel
return X

if sparse.issparse(X_sel) or sparse.issparse(X_not_sel):
return sparse.hstack((X_sel, X_not_sel))
else:
return np.hstack((X_sel, X_not_sel))


class OneHotEncoder(BaseEstimator, TransformerMixin):
Expand Down
Loading
0