8000 [MRG] Efficiency updates to KBinsDiscretizer by glevv · Pull Request #19290 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Efficiency updates to KBinsDiscretizer #19290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 79 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
d93b74c
KBD changes
glevv Jan 26, 2021
59a84d6
Small fix
glevv Jan 26, 2021
566ac59
Added checks for n_bins=str
glevv Jan 26, 2021
675e1e6
Lint changes
glevv Jan 28, 2021
ab4f868
Changed behaviour to catch n_bins<2 in 'auto'
glevv Jan 28, 2021
87579ed
Update _discretization.py
glevv Jan 28, 2021
44ea08d
Update _discretization.py
glevv Jan 28, 2021
731de9d
Update sklearn/preprocessing/_discretization.py
glevv Jan 28, 2021
55b84f6
Update sklearn/preprocessing/_discretization.py
glevv Jan 28, 2021
fc9f935
Update test_discretization.py
glevv Jan 28, 2021
f5bff72
Update test_discretization.py
glevv Jan 28, 2021
a8e24e5
Update _discretization.py
glevv Jan 28, 2021
c363827
Update _discretization.py
glevv Jan 28, 2021
6d767c5
Update _discretization.py
glevv Jan 28, 2021
dc6b095
Update _discretization.py
glevv Jan 28, 2021
1dac5f9
Update _discretization.py
glevv Jan 28, 2021
abff576
Update _discretization.py
glevv Jan 29, 2021
bcb118d
Update _discretization.py
glevv Jan 29, 2021
cca972c
Update test_discretization.py
glevv Jan 29, 2021
7b552e6
Update _discretization.py
glevv Jan 29, 2021
617bf90
Update _discretization.py
glevv Jan 29, 2021
9563a1c
Update _discretization.py
glevv Jan 29, 2021
28dbbc5
Update _discretization.py
glevv Jan 29, 2021
3e5a86d
Update _discretization.py
glevv Jan 29, 2021
0d7cc14
Update sklearn/preprocessing/_discretization.py
glevv Jan 29, 2021
0c9eb18
Update sklearn/preprocessing/_discretization.py
glevv Jan 29, 2021
119976b
Update test_discretization.py
glevv Jan 29, 2021
fdcbd29
Update _discretization.py
glevv Jan 29, 2021
fe5bc1d
added test for auto
glevv Jan 29, 2021
aede54a
Update test_discretization.py
glevv Jan 29, 2021
a0b04da
Update test_discretization.py
glevv Jan 29, 2021
ca735de
Update sklearn/preprocessing/_discretization.py
glevv Jan 31, 2021
08f5467
Update sklearn/preprocessing/tests/test_discretization.py
glevv Jan 31, 2021
eba8dbb
Update test_discretization.py
glevv Jan 31, 2021
1eb2322
Update _discretization.py
glevv Jan 31, 2021
d623a3c
Update test_docstring_parameters.py
glevv Jan 31, 2021
78cdde1
Update test_common.py
glevv Jan 31, 2021
396d4a8
Update v1.0.rst
glevv Jan 31, 2021
f577a03
Update v1.0.rst
glevv Jan 31, 2021
16abf9e
Update _discretization.py
glevv Jan 31, 2021
94f9c87
Update _discretization.py
glevv Jan 31, 2021
f499549
Update test_docstring_parameters.py
glevv Jan 31, 2021
e266fad
Update v1.0.rst
glevv Jan 31, 2021
cb72479
Update v1.0.rst
glevv Jan 31, 2021
66a4682
Update v1.0.rst
glevv Jan 31, 2021
840c77a
Update _discretization.py
glevv Jan 31, 2021
9cdb920
Update test_common.py
glevv Jan 31, 2021
ff127f5
Update _discretization.py
glevv Jan 31, 2021
641d58d
Update test_docstring_parameters.py
glevv Jan 31, 2021
3e82bee
Update v1.0.rst
glevv Jan 31, 2021
71520f3
Update test_common.py
glevv Jan 31, 2021
76c68a7
Update estimator_checks.py
glevv Jan 31, 2021
4e01fd5
Update estimator_checks.py
glevv Jan 31, 2021
71bb2e5
Update _discretization.py
glevv Jan 31, 2021
28ae05e
Update test_discretization.py
glevv Jan 31, 2021
835a514
Update estimator_checks.py
glevv Jan 31, 2021
6ff5ee6
Update _discretization.py
glevv Jan 31, 2021
01ad4de
Update _discretization.py
glevv Jan 31, 2021
fce49a5
Update _discretization.py
glevv Feb 1, 2021
0d9ffca
Merge branch 'kbd_changes' into main
glevv Feb 2, 2021
2303075
Revert "DOC Add URL to reference of Minka paper used in PCA (#19207)"
glevv Feb 2, 2021
4f99c48
Revert "DOC update Keras description in related projects (#19265)"
glevv Feb 2, 2021
cdee357
Revert "CLN Removes duplicated or unneeded code in ColumnTransformer …
glevv Feb 2, 2021
667c7b8
Merge pull request #1 from GLevV/kbd_changes
glevv Feb 2, 2021
6e69d3a
Merge branch 'main' of https://github.com/GLevV/scikit-learn into main
glevv Feb 2, 2021
15ba412
Revert "Kbd changes"
glevv Feb 8, 2021
cb320b6
Merge pull request #2 from GLevV/revert-1-kbd_changes
glevv Feb 8, 2021
f7f394a
reverse
glevv Feb 8, 2021
173f18f
reverse
glevv Feb 8, 2021
192a37c
reverse
glevv Feb 8, 2021
f003f99
reverse
glevv Feb 8, 2021
181e0c4
reverse
glevv Feb 8, 2021
4a3380a
Update _discretization.py
glevv Feb 8, 2021
cf29075
reverse
glevv Apr 20, 2021
f4d30ab
reverse
glevv Apr 20, 2021
334011a
Merge pull request #3 from scikit-learn/main
glevv Apr 20, 2021
6e3f50f
Update _discretization.py
glevv Apr 20, 2021
0b32083
Merge branch 'kbd_changes' into main
glevv Apr 20, 2021
e8ecd14
Merge pull request #5 from GLevV/main
glevv Apr 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/related_projects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,8 @@ and tasks.
- `nolearn <https://github.com/dnouri/nolearn>`_ A number of wrappers and
abstractions around existing neural network libraries

- `Keras <https://www.tensorflow.org/api_docs/python/tf/keras>`_ High-level API for
TensorFlow with a scikit-learn inspired API.
- `keras <https://github.com/fchollet/keras>`_ Deep Learning library capable of
running on top of either TensorFlow or Theano.

- `lasagne <https://github.com/Lasagne/Lasagne>`_ A lightweight library to
build and train neural networks in Theano.
Expand Down
4 changes: 4 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,10 @@ Changelog
supporting sparse matrix and raise the appropriate error message.
:pr:`19879` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Efficiency| Changed ``algorithm`` argument for :class:`cluster.KMeans` in
:class:`preprocessing.KBinsDiscretizer` from ``auto`` to ``full``.
:pr:`19290` by :user:`Gleb Levitskiy <GLevV>`.

:mod:`sklearn.tree`
...................

Expand Down
17 changes: 12 additions & 5 deletions sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from ..utils import Bunch
from ..utils import _safe_indexing
from ..utils import _get_column_indices
from ..utils import _determine_key_type
from ..utils.metaestimators import _BaseComposition
from ..utils.validation import check_array, check_is_fitted
from ..utils.validation import _deprecate_positional_args
Expand Down Expand Up @@ -327,6 +328,12 @@ def _validate_remainder(self, X):
"'passthrough', or estimator. '%s' was passed instead" %
self.remainder)

# Make it possible to check for reordered named columns on transform
self._has_str_cols = any(_determine_key_type(cols) == 'str'
for cols in self._columns)
if hasattr(X, 'columns'):
self._df_columns = X.columns

self._n_features = X.shape[1]
cols = []
for columns in self._columns:
Expand Down Expand Up @@ -362,12 +369,12 @@ def get_feature_names(self):
if trans == 'drop' or _is_empty_column_selection(column):
continue
if trans == 'passthrough':
if self._feature_names_in is not None:
if hasattr(self, '_df_columns'):
if ((not isinstance(column, slice))
and all(isinstance(col, str) for col in column)):
feature_names.extend(column)
else:
feature_names.extend(self._feature_names_in[column])
feature_names.extend(self._df_columns[column])
else:
indices = np.arange(self._n_features)
feature_names.extend(['x%d' % i for i in indices[column]])
Expand Down Expand Up @@ -463,7 +470,7 @@ def _fit_transform(self, X, y, func, fitted=False):
message_clsname='ColumnTransformer',
message=self._log_message(name, idx, len(transformers)))
for idx, (name, trans, column, weight) in enumerate(
transformers, 1))
self._iter(fitted=fitted, replace_strings=True), 1))
except ValueError as e:
if "Expected 2D array, got 1D array instead" in str(e):
raise ValueError(_ERR_MSG_1DCOLUMN) from e
Expand Down Expand Up @@ -629,9 +636,9 @@ def _sk_visual_block_(self):
transformers = self.transformers
elif hasattr(self, "_remainder"):
remainder_columns = self._remainder[2]
if self._feature_names_in is not None:
if hasattr(self, '_df_columns'):
remainder_columns = (
self._feature_names_in[remainder_columns].tolist()
self._df_columns[remainder_columns].tolist()
)
transformers = chain(self.transformers,
[('remainder', self.remainder,
Expand Down
30 changes: 12 additions & 18 deletions sklearn/decomposition/_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,7 @@ def _assess_dimension(spectrum, rank, n_samples):
"""Compute the log-likelihood of a rank ``rank`` dataset.

The dataset is assumed to be embedded in gaussian noise of shape(n,
dimf) having spectrum ``spectrum``. This implements the method of
T. P. Minka.
dimf) having spectrum ``spectrum``.

Parameters
----------
Expand All @@ -51,11 +50,10 @@ def _assess_dimension(spectrum, rank, n_samples):
ll : float
The log-likelihood.

References
----------
Notes
-----
This implements the method of `Thomas P. Minka:
Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604
<https://proceedings.neurips.cc/paper/2000/file/7503cfacd12053d309b6bed5c89de212-Paper.pdf>`_
Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604`
"""

n_features = spectrum.shape[0]
Expand Down Expand Up @@ -274,30 +272,26 @@ class PCA(_BasePCA):

References
----------
For n_components == 'mle', this class uses the method from:
`Minka, T. P.. "Automatic choice of dimensionality for PCA".
In NIPS, pp. 598-604 <https://tminka.github.io/papers/pca/minka-pca.pdf>`_
For n_components == 'mle', this class uses the method of *Minka, T. P.
"Automatic choice of dimensionality for PCA". In NIPS, pp. 598-604*

Implements the probabilistic PCA model from:
`Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal
Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal
component analysis". Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(3), 611-622.
<http://www.miketipping.com/papers/met-mppca.pdf>`_
via the score and score_samples methods.
See http://www.miketipping.com/papers/met-mppca.pdf

For svd_solver == 'arpack', refer to `scipy.sparse.linalg.svds`.

For svd_solver == 'randomized', see:
`Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).
*Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).
"Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions".
SIAM review, 53(2), 217-288.
<https://doi.org/10.1137/090771806>`_
and also
`Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).
SIAM review, 53(2), 217-288.* and also
*Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).
"A randomized algorithm for the decomposition of matrices".
Applied and Computational Harmonic Analysis, 30(1), 47-68
<https://doi.org/10.1016/j.acha.2010.02.003>`_.
Applied and Computational Harmonic Analysis, 30(1), 47-68.*

Examples
--------
Expand Down
3 changes: 2 additions & 1 deletion sklearn/preprocessing/_discretization.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,8 @@ def fit(self, X, y=None):
init = (uniform_edges[1:] + uniform_edges[:-1])[:, None] * 0.5

# 1D k-means procedure
km = KMeans(n_clusters=n_bins[jj], init=init, n_init=1)
km = KMeans(n_clusters=n_bins[jj], init=init,
n_init=1, algorithm='full')
centers = km.fit(column[:, None]).cluster_centers_[:, 0]
# Must sort, centers may be unsorted even with sorted init
centers.sort()
Expand Down
0