8000 Fix errorbars legend by vene · Pull Request #2 · ogrisel/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Fix errorbars legend #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
61af784
first stab at nearest center in cython (+30% perf, need check correct…
ogrisel Oct 9, 2011
6ade898
factorized label assignement as a reusable python func for the predic…
ogrisel Oct 9, 2011
ff9bd5b
use direct blas ddot call and reuse _assign_labels in predict
ogrisel Oct 10, 2011
2b04cff
FIX: broken test cause by the use of todense which return a matrix in…
ogrisel Oct 10, 2011
9aabbdb
WIP on simpler cython impl of the center update (still buggy)
ogrisel Oct 13, 2011
aa13538
compute inertia + remove code :)
ogrisel Oct 13, 2011
6cd6c30
update renamed function call
ogrisel Oct 13, 2011
87a6f5b
factorize dot product and bootstrap implementation for the dense case
ogrisel Oct 14, 2011
b7fe3bc
use cpdef + less array overhead in ddot
ogrisel Oct 15, 2011
5576ecf
started kmeans test suite refactoring
ogrisel Oct 15, 2011
6168d9c
more code factorization
ogrisel Oct 15, 2011
b2a8956
refactored the kmeans tests
ogrisel Oct 15, 2011
5f8d554
test and fix input checks for various dypes
ogrisel Oct 15, 2011
6d1dda8
much cheaper yet stable stopping criterion for the minibatch kmeans
ogrisel Oct 15, 2011
a14778e
Merge branch 'master' into minibatch-kmeans-optim
8000 ogrisel Oct 25, 2011
c68a368
unused import
ogrisel Oct 27, 2011
983556e
low memory computation of the square diff
ogrisel Oct 30, 2011
054b682
be more consistent with the usual behavior of fitted attributes
ogrisel Oct 30, 2011
e7c02a3
base convergence detection on EWA inertia monitoring
ogrisel Oct 30, 2011
8170fb5
various cython cleanups
ogrisel Oct 31, 2011
744072e
working in progress to make it possible to use a speedy version based…
ogrisel Nov 1, 2011
76e6197
merge master
ogrisel Nov 1, 2011
e8ddec5
preparing new stopping criterion impl
ogrisel Nov 1, 2011
562fcae
Merge branch 'master' into minibatch-kmeans-optim
ogrisel Nov 1, 2011
7813be0
work in progress (broken tests) on early stopping with both tol and i…
ogrisel Nov 1, 2011
e13b3f0
make min_dist test more explicit
ogrisel Nov 2, 2011
77c9663
fixed broken test
ogrisel Nov 2, 2011
d17fba0
optimize label assignment for dense minibatch and new test
ogrisel Nov 2, 2011
a2d136f
fix tests
ogrisel Nov 2, 2011
c10964d
fix tests
ogrisel Nov 2, 2011
8579119
start with zero counts in tests
ogrisel Nov 2, 2011
d117ec5
fix bug: x_squared_norms should follow the shuffle...
ogrisel Nov 3, 2011
77db343
ensure that the sparse and dense variant of the minibatch update comp…
ogrisel Nov 3, 2011
82c3f62
better default value and parameter handling for max_no_improvement
ogrisel Nov 3, 2011
a0f2598
switch to lazy sampling with explicit index to divide memory usage al…
ogrisel Nov 3, 2011
28b4d88
more code simplification
ogrisel Nov 3, 2011
9224538
started example to check the convergence stability in various settings
ogrisel Nov 5, 2011
161430c
tracking changes from master
ogrisel Nov 5, 2011
113d394
merge master
ogrisel Nov 5, 2011
37df796
implemented n_init for MiniBatchKMeans
ogrisel Nov 6, 2011
4f9f32c
Merge branch 'master' into minibatch-kmeans-optim
ogrisel Nov 6, 2011
3d58c49
refactored the init logic for MiniBatchKMeans
ogrisel Nov 6, 2011
a68c85f
Merge branch 'master' into minibatch-kmeans-optim
ogrisel Nov 6, 2011
8858a80
fix stability and warning in tests
ogrisel Nov 6, 2011
9fbe559
make k-means++ work on sparse input and use it as default for MB k-means
ogrisel Nov 6, 2011
768e471
add version info in deprecation message
ogrisel Nov 6, 2011
6af7996
factorized out the early stopping logic in a dedicated method
ogrisel Nov 6, 2011
982359c
first stab at a reinit strategy that work on low dim data only
ogrisel Nov 6, 2011
f87248d
new example to emphasize issues with current naive reinit scheme on s…
ogrisel Nov 6, 2011
3f1901c
second experiment on reinit that does not work on high dim sparse dat…
ogrisel Nov 7, 2011
d9aa128
Merge branch 'master' into minibatch-kmeans-optim
ogrisel Nov 14, 2011
e1325f2
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
ogrisel Dec 19, 2011
50af0c3
track changes from master
ogrisel Dec 19, 2011
695ce94
pep8
ogrisel Dec 19, 2011
ae26835
fix k_means docstring to better match the scikit naming conventions
ogrisel Dec 19, 2011
6db1ff8
WIP: n_init refactoring
ogrisel Dec 19, 2011
70c0aa1
Merge branch 8000 'master' into minibatch-kmeans-optim
ogrisel Dec 19, 2011
38d8444
scale tolerance of minibatch kmeans on CSR input variance
ogrisel Dec 19, 2011
29fa29f
delete broken example
ogrisel Dec 19, 2011
f9aacba
example script is not meant to be executed when building the doc as i…
ogrisel Dec 19, 2011
fde9807
Add score method to KMeans.
mblondel Dec 19, 2011
af05e76
Merge branch 'master' into minibatch-kmeans-optim
ogrisel Dec 20, 2011
2f3cca5
typo: accross => across
ogrisel Dec 20, 2011
da38d74
Use python int for indices and indptr of scipy sparse matrices to ens…
ogrisel Dec 20, 2011
adfae9b
Make init less expensive by default on MinibatchKMeans to avoid domin…
ogrisel Dec 20, 2011
e93482e
Fix broken duplicated / tests and more practical init
ogrisel Dec 20, 2011
712301d
Merge branch 'master' into ogrisel_minibatch-kmeans-optim
vene Dec 20, 2011
a9e013b
Fix legend for errorbar plot
vene Dec 20, 2011
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
*.swp
.DS_Store
build
scikits/learn/datasets/__config__.py
scikits/learn/**/*.html
sklearn/datasets/__config__.py
sklearn/**/*.html

dist/
doc/_build/
Expand Down
121 changes: 121 additions & 0 deletions examples/cluster/kmeans_stability_low_dim_dense.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
"""
============================================================
Empirical evaluation of the impact of k-means initialization
============================================================

Evaluate the ability of k-means initializations strategies to make
the algorithm convergence robust as measured by the relative standard
deviation of the inertia of the clustering (i.e. the sum of distances
to the nearest cluster center).

The dataset used for evaluation is a 2D grid of isotropic gaussian
clusters widely spaced.

"""
print __doc__

# Author: Olivier Grisel <olivier.grisel@ensta.org>
# License: Simplified BSD

import numpy as np
import pylab as pl
import matplotlib.cm as cm

from sklearn.utils import shuffle
from sklearn.utils import check_random_state
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans

random_state = np.random.RandomState(0)

# Number of run (with randomly generated dataset) for each strategy so as
# to be able to compute an estimate of the standard deviation
n_runs = 5

# k-means models can do several random inits so as to be able to trade
# CPU time for convergence robustness
n_init_range = np.array([1, 5, 10, 15, 20])

# Datasets generation parameters
n_samples_per_center = 100
grid_size = 3
scale = 0.1
n_clusters = grid_size ** 2


def make_data(random_state, n_samples_per_center, grid_size, scale):
random_state = check_random_state(random_state)
centers = np.array([[i, j]
for i in range(grid_size)
for j in range(grid_size)])
n_clusters_true, n_featues = centers.shape

noise = random_state.normal(
scale=scale, size=(n_samples_per_center, centers.shape[1]))

X = np.concatenate([c + noise for c in centers])
y = np.concatenate([[i] * n_samples_per_center
for i in range(n_clusters_true)])
return shuffle(X, y, random_state=random_state)


fig = pl.figure()
plots = []
legends = []

cases = [
(KMeans, 'k-means++', {}),
(KMeans, 'random', {}),
(MiniBatchKMeans, 'k-means++', {'max_no_improvement': 3}),
(MiniBatchKMeans, 'random', {'max_no_improvement': 3, 'init_size': 500}),
]

for factory, init, params in cases:
print "Evaluation of %s with %s init" % (factory.__name__, init)
inertia = np.empty((len(n_init_range), n_runs))

for run_id in range(n_runs):
X, y = make_data(run_id, n_samples_per_center, grid_size, scale)
for i, n_init in enumerate(n_init_range):
km = factory(k=n_clusters,
init=init,
random_state=run_id,
n_init=n_init,
**params).fit(X)
inertia[i, run_id] = km.inertia_
print "Inertia for n_init=%02d, run_id=%d: %0.3f" % (
n_init, run_id, km.inertia_)

plots.append(
pl.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1)))
n_reinit = params.get('n_reinit')
if n_reinit is not None:
legends.append("%s with %s init and %d reinit" % (
factory.__name__, init, n_reinit))
else:
legends.append("%s with %s init" % (factory.__name__, init))

plots = [plot[0] for plot in plots] # take only the first line in each plot
pl.xlabel('n_init')
pl.ylabel('inertia')
pl.legend(plots, legends)
pl.title("Mean inertia for various k-means init across %d runs" % n_runs)

# Part 2: qualitative visual inspection of the convergence

X, y = make_data(random_state, n_samples_per_center, grid_size, scale)
km = MiniBatchKMeans(k=n_clusters, init='random', n_init=1,
random_state=random_state).fit(X)

fig = pl.figure()
for k in range(n_clusters):
my_members = km.labels_ == k
color = cm.spectral(float(k) / n_clusters, 1)
pl.plot(X[my_members, 0], X[my_members, 1], 'o', marker='.', c=color)
cluster_center = km.cluster_centers_[k]
pl.plot(cluster_center[0], cluster_center[1], 'o',
markerfacecolor=color, markeredgecolor='k', markersize=6)
pl.title("Example cluster allocation with a single random init\n"
"with MiniBatchKMeans")

pl.show()
5 changes: 3 additions & 2 deletions examples/cluster/plot_mini_batch_kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
##############################################################################
# Compute clustering with Means

k_means = KMeans(init='k-means++', k=3)
k_means = KMeans(init='k-means++', k=3, n_init=10)
t0 = time.time()
k_means.fit(X)
t_batch = time.time() - t0
Expand All @@ -46,7 +46,8 @@
##############################################################################
# Compute clustering with MiniBatchKMeans

mbk = MiniBatchKMeans(init='k-means++', k=3, chunk_size=batch_size)
mbk = MiniBatchKMeans(init='k-means++', k=3, batch_size=batch_size,
n_init=10, max_no_improvement=10, verbose=0)
t0 = time.time()
mbk.fit(X)
t_mini_batch = time.time() - t0
Expand Down
8 changes: 4 additions & 4 deletions examples/document_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@
print "%d categories" % len(dataset.target_names)
print

# split a training set and a test set
labels = dataset.target
true_k = np.unique(labels).shape[0]

Expand All @@ -63,10 +62,11 @@
print

###############################################################################
# Now sparse MiniBatchKmeans
# Sparse MiniBatchKmeans

mbkm = MiniBatchKMeans(init="random", k=true_k, max_iter=10, random_state=13,
chunk_size=1000, verbose=0)
mbkm = MiniBatchKMeans(k=true_k, init='k-means++', n_init=1,
init_size=1000,
batch_size=1000, verbose=1)
print "Clustering sparse data with %s" % mbkm
t0 = time()
mbkm.fit(X)
Expand Down
2 changes: 1 addition & 1 deletion sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ def _get_params(self, deep=True):
"""
out = dict()
for key in self._get_param_names():
value = getattr(self, key)
value = getattr(self, key, None)
if deep and hasattr(value, '_get_params'):
deep_items = value._get_params().items()
out.update((key + '__' + k, val) for k, val in deep_items)
Expand Down
Loading
0