Use sparse contingency matrix for supervised cluster metrics #7203

stuppie · 2016-08-18T01:37:38Z

Replaces max_n_classes

In sklearn.metrics.cluster.supervised
With large numbers of clusters (approx. >100k), construction of the contingency matrix runs out of memory (throws MemoryError).

>>> from random import randrange
>>> labels_true = [randrange(100000) for x in range(100000)]
>>> labels_pred = [randrange(100000) for x in range(100000)]
>>> contingency = contingency_matrix(labels_true, labels_pred)
... MemoryError
>>> contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
... No error

Using a sparse matrix instead allows construction of the contingency matrix.

Modified adjusted_rand_score
homogeneity_completeness_v_measure
mutual_info_score
to accept a sparse matrix.

Remove max_n_classes option

jnothman · 2016-08-18T01:47:24Z

We can't just drop max_n_classes now that it's part of the public API. It needs to be deprecated, unfortunately :(

jnothman · 2016-08-18T01:47:53Z

See http://scikit-learn.org/stable/developers/contributing.html#deprecation

stuppie · 2016-08-18T02:16:57Z

@jnothman Ok, now both are in

jnothman · 2016-08-18T02:30:22Z

sklearn/metrics/cluster/supervised.py

@@ -46,7 +48,7 @@ def check_clusterings(labels_true, labels_pred):
    return labels_true, labels_pred


-def contingency_matrix(labels_true, labels_pred, eps=None, max_n_classes=5000):
+def contingency_matrix(labels_true, labels_pred, eps=None, max_n_classes=5000, sparse=False):


pep8 max line length = 79

jnothman · 2016-08-18T02:38:17Z

I reckon we should just always use sparse=True when calculating contingency internally. It's hard to imagine a setting where a sparse matrix would be inefficient, i.e. the contingency matrix is densely populated but there are many classes.

Corollaries:

This makes max_n_classes irrelevant. It should be deprecated from all the metric functions (but can be kept if we really want in contingency_matrix).
It's okay still to allow the user to supply their own contingency matrix, but I think the API for this remains questionable, so we should leave it out of this PR.

jnothman · 2016-08-18T02:41:14Z

sklearn/metrics/cluster/supervised.py

+        # For a sparse matrix
+        sum_comb_c = sum(comb2(n_c) for n_c in np.array(contingency.sum(axis=1)))
+        sum_comb_k = sum(comb2(n_k) for n_k in np.array(contingency.sum(axis=0)).T)
+        sum_comb = sum(comb2(n_ij) for n_ij in find(contingency)[2])


If it's a _data_matrix surely contingency.data.flatten() will do without the more expensive find operation?

If its a coo_matrix, data contains only 1s

In [121]: C.toarray() Out[121]: array([[5, 1, 0], [1, 4, 1], [0, 2, 3]]) In [122]: C.data Out[122]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

I can do:

In [123]: C.tocsr().data Out[123]: array([5, 1, 1, 4, 1, 2, 3], dtype=int64)

?

jnothman · 2016-08-22T12:18:23Z

Thanks for the PEP8 fixes, @stuppie. What do you think of my always sparse suggestion? With the current version, it should be straightforward to run a few benchmarks to assess whether sparse=False is ever substantially more performant than sparse=True.

stuppie · 2016-08-22T18:33:11Z

@jnothman I think this would be fine expect that expected_mutual_information does not accept sparse matrices, and so adjusted_mutual_info_score wouldn't work either. I suppose those two functions could explicitly set sparse=False

stuppie · 2016-08-22T19:09:53Z

@jnothman Here's a rough comparison: https://gist.github.com/stuppie/4d45459df3be477c5e46535aeb5c9b7e
I think it would be best to convert the coo_matrix to a csc sparse by default, and use that for the other metrics.

I also tried this with a small contingency matrix:
labels_true = [randint(0,10) for _ in range(100)]
labels_pred = [randint(0,10) for _ in range(100)]
But it says its caching intermediate results, so I don't know what to make of it. The differences seem very small anyways.

Yes, the point is to check the small number of clusters case. I don't have enough time for reviewing all the issues I have open, though I'd like to see this merged for 0.18, in particular if we can deprecate the max_n_classes which I find inappropriate! :) So if you clearly can make the case for whether or not it's safe to always use sparse=True, perhaps with some real clustering of (real or artificial) data, and post the results here, that would be great.

amueller · 2016-08-22T21:34:30Z

I though max_n_classes was a hack to check if someone provided floating point numbers? (I didn't like it either ;).
What happens now if someone passes (unique) floats? No blow-up, I guess, but some garbage result. Can we maybe catch that?

jnothman · 2016-08-22T23:46:09Z

max_n_classes is bad API design if what we want is validation. Passing floats to a cluster scorer is a pretty bad violation of the documented usage... We can use type_of_target and throw an error, but we don't need to muddy our API with unnecessary parameters.

amueller · 2016-08-26T20:05:42Z

I don't think it's a great design either. I just wanted to make sure that we can catch the problem it was intended to solve (IIRC). passing a float is bad, but crashing the computer because of ram issues (I guess that will be less of an issue with sparse matrices) is also bad.

jnothman · 2016-08-27T12:28:20Z

Putting regression output into a clustering metric is very unlikely to result in problems if a sparse matrix is used internally.

ogrisel · 2016-09-14T11:21:58Z

sklearn/metrics/cluster/supervised.py

+    # Compute contingency matrix if we weren't given it
+    if contingency is None:
+        contingency = contingency_matrix(labels_true, labels_pred,
+                                         max_n_classes=max_n_classes)


This still builds a dense contingency matrix by default. Maybe we should have an internal heuristic: if n_classes is small we use a dense array (because it's faster to compute) and if it's larger than than some threshold we switch to sparse.

Or we could decide to always build sparse contingency matrices internally to keep the code simpler and expect that a difference in speed will never exceed a couple of seconds for small to medium inputs while very large input will only be addressable via sparse contingency matrices anyway.

I'm benchmarking now. will post a WIP to supersede this one.

ogrisel · 2016-09-14T12:34:47Z

Putting regression output into a clustering metric is very unlikely to result in problems if a sparse matrix is used internally.

I agree. We could still issue a warning if n_samples == n_classes and input types are float but that should not crash.

jnothman · 2016-09-14T13:41:31Z

Completed in #7419

use sparse contingency matrix for supervised cluster metrics

a44d480

Remove max_n_classes option

stuppie added 2 commits August 17, 2016 19:11

merge sparse and max_n_classes functionality

cc50125

clarify docs

8ec4d11

jnothman reviewed Aug 18, 2016
View reviewed changes

pep8

630bfa1

amueller added this to the 0.18 milestone Aug 26, 2016

ogrisel added the Blocker label Sep 14, 2016

ogrisel mentioned this pull request Sep 14, 2016

[MRG+2] Raise ValueError for metrics.cluster.supervised with too many classes #5445

Merged

ogrisel reviewed Sep 14, 2016
View reviewed changes

jnothman mentioned this pull request Sep 14, 2016

[MRG+1] Use sparse cluster contingency matrix by default #7419

Closed

jnothman closed this Sep 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use sparse contingency matrix for supervised cluster metrics #7203

Use sparse contingency matrix for supervised cluster metrics #7203

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use sparse contingency matrix for supervised cluster metrics #7203

Use sparse contingency matrix for supervised cluster metrics #7203

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!