8000 ENH Scalable MiniBatchKMeans plus cln / fixes / refactoring by jeremiedbb · Pull Request #17622 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Scalable MiniBatchKMeans plus cln / fixes / refactoring #17622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 93 commits into from
Apr 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
e199214
refactoring
jeremiedbb Feb 24, 2020
7f85bca
wip
jeremiedbb Feb 27, 2020
ca728d5
wip
jeremiedbb Feb 28, 2020
b799aeb
wip
jeremiedbb Mar 2, 2020
b5b46f4
wip
jeremiedbb Mar 3, 2020
6fb2333
wip
jeremiedbb Mar 5, 2020
bcaa022
wip
jeremiedbb Mar 6, 2020
76c3589
wip
jeremiedbb Mar 6, 2020
231542d
wip
jeremiedbb Mar 6, 2020
aae7bcc
Merge branch 'master' into refactor-kmeans
jeremiedbb Mar 6, 2020
3f475f6
wip
jeremiedbb Mar 6, 2020
f73077b
wip
jeremiedbb Mar 9, 2020
21d5d24
wip
jeremiedbb Mar 9, 2020
a5f9cad
wip
jeremiedbb Mar 10, 2020
c4fb7a8
wip
jeremiedbb Mar 10, 2020
3713094
wip
jeremiedbb Mar 10, 2020
6a6fbfb
wip
jeremiedbb Mar 10, 2020
3be5343
wip
jeremiedbb Mar 10, 2020
2add01e
wip
jeremiedbb Apr 22, 2020
6730aa6
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jun 15, 2020
7d7ab15
wip
jeremiedbb Jun 15, 2020
0523c65
wip
jeremiedbb Jun 17, 2020
9186d85
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jun 17, 2020
2d789aa
wip
jeremiedbb Jun 17, 2020
37408e6
wip
jeremiedbb Jun 17, 2020
78915ac
wip
jeremiedbb Jun 17, 2020
fcc2718
wip
jeremiedbb Jun 17, 2020
73f1bc2
wip
jeremiedbb Jun 17, 2020
7325c89
wip
jeremiedbb Jun 17, 2020
a824566
wip
jeremiedbb Jun 18, 2020
a4edafb
reduce diff
jeremiedbb Jul 6, 2020
0993a85
reduce diff
jeremiedbb Jul 6, 2020
b308961
reduce diff
jeremiedbb Jul 6, 2020
121450b
reduce diff
jeremiedbb Jul 6, 2020
398e305
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jul 8, 2020
6c67dd1
reduce diff
jeremiedbb Jul 8, 2020
6e78c7e
reduce diff
jeremiedbb Jul 9, 2020
f13441b
reduce diff
jeremiedbb Jul 9, 2020
1b97c0b
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jul 9, 2020
f08d3d2
fix merge conflicts
jeremiedbb Jul 9, 2020
b712de6
Estimator
jeremiedbb Jul 9, 2020
3a1bbe8
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jul 13, 2020
4a2bf11
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jul 15, 2020
43a74ae
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jul 17, 2020
153a06f
cln
jeremiedbb Jul 17, 2020
d2d6832
cln
jeremiedbb Jul 17, 2020
690f5b9
cln
jeremiedbb Jul 17, 2020
158aeed
cln
jeremiedbb Jul 22, 2020
e11cedb
wip
jeremiedbb Jul 23, 2020
bbdabf5
threadpool-limit protection
jeremiedbb Jul 24, 2020
53691e4
idx
jeremiedbb Jul 24, 2020
24a267f
random_reassign
jeremiedbb Jul 24, 2020
146a93b
wip
jeremiedbb Jul 23, 2020
412864f
threadpool-limit protection
jeremiedbb Jul 24, 2020
421a041
idx
jeremiedbb Jul 24, 2020
a6862f8
random_reassign
jeremiedbb Jul 24, 2020
355627d
wip
jeremiedbb Jul 28, 2020
d5a8935
wip
jeremiedbb Jul 29, 2020
4d29bc3
wip
jeremiedbb Jul 31, 2020
0fdbc6e
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Jul 31, 2020
de3180e
wip
jeremiedbb Jul 31, 2020
a3b55b7
ellipsis
jeremiedbb Jul 31, 2020
1a46b99
Merge branch 'refactor-kmeans' of github.com:jeremiedbb/scikit-learn …
jeremiedbb Jul 31, 2020
4d06cc2
idx
jeremiedbb Jul 31, 2020
d354434
wip
jeremiedbb Jul 31, 2020
c6a0456
avoid calling openmp_effective_n_threads again
jeremiedbb Jul 31, 2020
be97ca2
Merge branch 'master' into refactor-kmeans
jeremiedbb Aug 4, 2020
9c93037
cln
jeremiedbb Aug 4, 2020
3efab58
Merge branch 'master' into refactor-kmeans
jeremiedbb Oct 23, 2020
00259a7
Merge branch 'refactor-kmeans' of github.com:jeremiedbb/scikit-learn …
jeremiedbb Oct 23, 2020
d263d30
fix merging mistake
jeremiedbb Oct 23, 2020
d2ee373
Merge branch 'master' into refactor-kmeans
jeremiedbb Nov 3, 2020
b14492a
merge master
jeremiedbb Nov 3, 2020
a3e1b11
change batch_size default
jeremiedbb Nov 3, 2020
10695c6
actually change batch size
jeremiedbb Nov 3, 2020
3d2bc49
Merge remote-tracking branch 'upstream/master' into refactor-kmeans
jeremiedbb Nov 13, 2020
e4e15f5
reassignment_ratio docstring
jeremiedbb Nov 13, 2020
214feca
Merge branch 'master' into refactor-kmeans
jeremiedbb Jan 28, 2021
5f4f065
cln
jeremiedbb Jan 28, 2021
35f6e58
Merge branch 'master' into refactor-kmeans
jeremiedbb Jan 28, 2021
ab4310a
make n_iter_ count number of started epochs
jeremiedbb Jan 28, 2021
5aafed3
improve tests and docs
jeremiedbb Feb 5, 2021
0a71a92
don't move kmpp
jeremiedbb Mar 12, 2021
eacb6cc
address comments
jeremiedbb Mar 12, 2021
c71b5b5
add what's new entry
jeremiedbb Mar 12, 2021
5d4e3d9
remove warning in test
jeremiedbb Mar 12, 2021
42435f1
Merge branch 'master' into refactor-kmeans
jeremiedbb Mar 12, 2021
ebefe18
lint
jeremiedbb Mar 12, 2021
14bac02
Merge branch 'master' into refactor-kmeans
jeremiedbb Apr 7, 2021
be0c948
adress comments
jeremiedbb Apr 7, 2021
5ff60c8
Update sklearn/cluster/_kmeans.py
jeremiedbb Apr 7, 2021
e30d8ef
Merge branch 'master' into refactor-kmeans
jeremiedbb Apr 14, 2021
b18f57b
Merge branch 'refactor-kmeans' of github.com:jeremiedbb/scikit-learn …
jeremiedbb Apr 14, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8000
22 changes: 20 additions & 2 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,30 @@ Changelog
in multicore settings. :pr:`19052` by
:user:`Yusuke Nagasaka <YusukeNagasaka>`.

- |API| :class:`cluster.Birch` attributes, `fit_` and `partial_fit_`, are
deprecated and will be removed in 1.2. :pr:`19297` by `Thomas Fan`_.
- |Efficiency| :class:`cluster.MiniBatchKMeans` is now faster in multicore
settings. :pr:`17622` by :user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Fix| Fixed a bug in :class:`cluster.MiniBatchKMeans` where the sample
weights were partially ignored when the input is sparse. :pr:`17622` by
:user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Fix| Improved convergence detection based on center change in
:class:`cluster.MiniBatchKMeans` which was almost never achievable.
:pr:`17622` by :user:`Jérémie du Boisberranger <jeremiedbb>`.

- |FIX| :class:`cluster.AgglomerativeClustering` now supports readonly
memory-mapped datasets. :pr:`19883` by `Julien Jerphanion <jjerphan>`.

- |API| :class:`cluster.Birch` attributes, `fit_` and `partial_fit_`, are
deprecated and will be removed in 1.2. :pr:`19297` by `Thomas Fan`_.

- |API| the default value for the `batch_size` parameter of
:class:`MiniBatchKMeans` was changed from 100 to 1024 due to efficiency
reasons. The `n_iter_` attribute of :class:`MiniBatchKMeans` now reports the
number of started epochs and the `n_steps_` attribute reports the number of
mini batches processed. :pr:`17622`
by :user:`Jérémie du Boisberranger <jeremiedbb>`.

:mod:`sklearn.compose`
......................

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,15 @@

import numpy as np
cimport numpy as np
cimport cython
from cython cimport floating
from cython.parallel cimport prange
from libc.math cimport sqrt

from ..utils.extmath import row_norms


np.import_array()

ctypedef np.float64_t DOUBLE
ctypedef np.int32_t INT


# Number of samples per data chunk defined as a global constant.
CHUNK_SIZE = 256
Expand Down Expand Up @@ -103,7 +100,8 @@ cpdef floating _inertia_dense(
np.ndarray[floating, ndim=2, mode='c'] X, # IN
floating[::1] sample_weight, # IN
floating[:, ::1] centers, # IN
int[::1] labels): # IN
int[::1] labels, # IN
int n_threads):
"""Compute inertia for dense input data

Sum of squared distance between each sample and its assigned center.
Expand All @@ -116,7 +114,8 @@ cpdef floating _inertia_dense(
floating sq_dist = 0.0
floating inertia = 0.0

for i in range(n_samples):
for i in prange(n_samples, nogil=True, num_threads=n_threads,
schedule='static'):
j = labels[i]
sq_dist = _euclidean_dense_dense(&X[i, 0], &centers[j, 0],
n_features, True)
Expand All @@ -129,7 +128,8 @@ cpdef floating _inertia_sparse(
X, # IN
floating[::1] sample_weight, # IN
floating[:, ::1] centers, # IN
int[::1] labels): # IN
int[::1] labels, # IN
int n_threads):
"""Compute inertia for sparse input data

Sum of squared distance between each sample and its assigned center.
Expand All @@ -148,7 +148,8 @@ cpdef floating _inertia_sparse(

floating[::1] centers_squared_norms = row_norms(centers, squared=True)

for i in range(n_samples):
for i in prange(n_samples, nogil=True, num_threads=n_threads,
schedule='static'):
j = labels[i]
sq_dist = _euclidean_sparse_dense(
X_data[X_indptr[i]: X_indptr[i + 1]],
Expand Down Expand Up @@ -286,104 +287,3 @@ cdef void _center_shift(
for j in range(n_clusters):
center_shift[j] = _euclidean_dense_dense(
&centers_new[j, 0], &centers_old[j, 0], n_features, False)


def _mini_batch_update_csr(X, np.ndarray[floating, ndim=1] sample_weight,
np.ndarray[floating, ndim=1] x_squared_norms,
np.ndarray[floating, ndim=2] centers,
np.ndarray[floating, ndim=1] weight_sums,
np.ndarray[INT, ndim=1] nearest_center,
np.ndarray[floating, ndim=1] old_center,
int compute_squared_diff):
"""Incremental update of the centers for sparse MiniBatchKMeans.

Parameters
----------

X : CSR matrix, dtype float
The complete (pre allocated) training set as a CSR matrix.

centers : array, shape (n_clusters, n_features)
The cluster centers

counts : array, shape (n_clusters,)
The vector in which we keep track of the numbers of elements in a
cluster

Returns
-------
inertia : float
The inertia of the batch prior to centers update, i.e. the sum
of squared distances to the closest center for each sample. This
is the objective function being minimized by the k-means algorithm.

squared_diff : float
The sum of squared update (squared norm of the centers position
change). If compute_squared_diff is 0, this computation is skipped and
0.0 is returned instead.

Both squared diff and inertia are commonly used to monitor the convergence
of the algorithm.
"""
cdef:
np.ndarray[floating, ndim=1] X_data = X.data
np.ndarray[int, ndim=1] X_indices = X.indices
np.ndarray[int, ndim=1] X_indptr = X.indptr
unsigned int n_samples = X.shape[0]
unsigned int n_clusters = centers.shape[0]
unsigned int n_features = centers.shape[1]

unsigned int sample_idx, center_idx, feature_idx
unsigned int k
DOUBLE old_weight_sum, new_weight_sum
DOUBLE center_diff
DOUBLE squared_diff = 0.0

# move centers to the mean of both old and newly assigned samples
for center_idx in range(n_clusters):
old_weight_sum = weight_sums[center_idx]
new_weight_sum = old_weight_sum

# count the number of samples assigned to this center
for sample_idx in range(n_samples):
if nearest_center[sample_idx] == center_idx:
new_weight_sum += sample_weight[sample_idx]

if new_weight_sum == old_weight_sum:
# no new sample: leave this center as it stands
continue

# rescale the old center to reflect it previous accumulated weight
# with regards to the new data that will be incrementally contributed
if compute_squared_diff:
old_center[:] = centers[center_idx]
centers[center_idx] *= old_weight_sum

# iterate of over samples assigned to this cluster to move the center
# location by inplace summation
for sample_idx in range(n_samples):
if nearest_center[sample_idx] != center_idx:
continue

# inplace sum with new samples that are members of this cluster
# and update of the incremental squared difference update of the
# center position
for k in range(X_indptr[sample_idx], X_indptr[sample_idx + 1]):
centers[center_idx, X_indices[k]] += X_data[k]

# inplace rescale center with updated count
if new_weight_sum > old_weight_sum:
# update the count statistics for this center
weight_sums[center_idx] = new_weight_sum

# re-scale the updated center with the total new counts
centers[center_idx] /= new_weight_sum

# update the incremental computation of the squared total
# centers position change
if compute_squared_diff:
for feature_idx in range(n_features):
squared_diff += (old_center[feature_idx]
- centers[center_idx, feature_idx]) ** 2

return squared_diff
14 changes: 7 additions & 7 deletions sklearn/cluster/_k_means_elkan.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ from libc.stdlib cimport calloc, free
from libc.string cimport memset, memcpy

from ..utils.extmath import row_norms
from ._k_means_fast import CHUNK_SIZE
from ._k_means_fast cimport _relocate_empty_clusters_dense
from ._k_means_fast cimport _relocate_empty_clusters_sparse
from ._k_means_fast cimport _euclidean_dense_dense
from ._k_means_fast cimport _euclidean_sparse_dense
from ._k_means_fast cimport _average_centers
from ._k_means_fast cimport _center_shift
from ._k_means_common import CHUNK_SIZE
from ._k_means_common cimport _relocate_empty_clusters_dense
from ._k_means_common cimport _relocate_empty_clusters_sparse
from ._k_means_common cimport _euclidean_dense_dense
from ._k_means_common cimport _euclidean_sparse_dense
from ._k_means_common cimport _average_centers
from ._k_means_common cimport _center_shift


np.import_array()
Expand Down
10 changes: 5 additions & 5 deletions sklearn/cluster/_k_means_lloyd.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,16 @@ cimport numpy as np
from cython cimport floating
from cython.parallel import prange, parallel
from libc.stdlib cimport malloc, calloc, free
from libc.string cimport memset, memcpy
from libc.string cimport memset
from libc.float cimport DBL_MAX, FLT_MAX

from ..utils.extmath import row_norms
from ..utils._cython_blas cimport _gemm
from ..utils._cython_blas cimport RowMajor, Trans, NoTrans
from ._k_means_fast import CHUNK_SIZE
from ._k_means_fast cimport _relocate_empty_clusters_dense
from ._k_means_fast cimport _relocate_empty_clusters_sparse
from ._k_means_fast cimport _average_centers, _center_shift
from ._k_means_common import CHUNK_SIZE
from ._k_means_common cimport _relocate_empty_clusters_dense
from ._k_means_common cimport _relocate_empty_clusters_sparse
from ._k_means_common cimport _average_centers, _center_shift


np.import_array()
Expand Down
Loading
0