8000 Use sparse contingency matrix for supervised cluster metrics by stuppie · Pull Request #7203 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Use sparse contingency matrix for supervised cluster metrics #7203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

stuppie
Copy link
Contributor
@stuppie stuppie commented Aug 18, 2016

See #4788

Replaces max_n_classes

In sklearn.metrics.cluster.supervised
With large numbers of clusters (approx. >100k), construction of the contingency matrix runs out of memory (throws MemoryError).

>>> from random import randrange
>>> labels_true = [randrange(100000) for x in range(100000)]
>>> labels_pred = [randrange(100000) for x in range(100000)]
>>> contingency = contingency_matrix(labels_true, labels_pred)
... MemoryError
>>> contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
... No error

Using a sparse matrix instead allows construction of the contingency matrix.

Modified adjusted_rand_score
homogeneity_completeness_v_measure
mutual_info_score
to accept a sparse matrix.

@jnothman
Copy link
Member

We can't just drop max_n_classes now that it's part of the public API. It needs to be deprecated, unfortunately :(

@jnothman
Copy link
Member

@stuppie
Copy link
Contributor Author
stuppie commented Aug 18, 2016

@jnothman Ok, now both are in

@@ -46,7 +48,7 @@ def check_clusterings(labels_true, labels_pred):
return labels_true, labels_pred


def contingency_matrix(labels_true, labels_pred, eps=None, max_n_classes=5000):
def contingency_matrix(labels_true, labels_pred, eps=None, max_n_classes=5000, sparse=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 max line length = 79

@jnothman
Copy link
Member

I reckon we should just always use sparse=True when calculating contingency internally. It's hard to imagine a setting where a sparse matrix would be inefficient, i.e. the contingency matrix is densely populated but there are many classes.

Corollaries:

  • This makes max_n_classes irrelevant. It should be deprecated from all the metric functions (but can be kept if we really want in contingency_matrix).
  • It's okay still to allow the user to supply their own contingency matrix, but I think the API for this remains questionable, so we should leave it out of this PR.

# For a sparse matrix
sum_comb_c = sum(comb2(n_c) for n_c in np.array(contingency.sum(axis=1)))
sum_comb_k = sum(comb2(n_k) for n_k in np.array(contingency.sum(axis=0)).T)
sum_comb = sum(comb2(n_ij) for n_ij in find(contingency)[2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a _data_matrix surely contingency.data.flatten() will do without the more expensive find operation?

Copy link
Contributor Author
@stuppie stuppie Aug 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a coo_matrix, data contains only 1s

In [121]: C.toarray()
Out[121]: 
array([[5, 1, 0],
       [1, 4, 1],
       [0, 2, 3]])

In [122]: C.data
Out[122]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

I can do:

In [123]: C.tocsr().data
Out[123]: array([5, 1, 1, 4, 1, 2, 3], dtype=int64)

?

@jnothman
Copy link
Member

Thanks for the PEP8 fixes, @stuppie. What do you think of my always sparse suggestion? With the current version, it should be straightforward to run a few benchmarks to assess whether sparse=False is ever substantially more performant than sparse=True.

@stuppie
Copy link
Contributor Author
stuppie commented Aug 22, 2016

@jnothman I think this would be fine expect that expected_mutual_information does not accept sparse matrices, and so adjusted_mutual_info_score wouldn't work either. I suppose those two functions could explicitly set sparse=False

@stuppie
Copy link
Contributor Author
stuppie commented Aug 22, 2016

@jnothman Here's a rough comparison: https://gist.github.com/stuppie/4d45459df3be477c5e46535aeb5c9b7e
I think it would be best to convert the coo_matrix to a csc sparse by default, and use that for the other metrics.

I also tried this with a small contingency matrix:
labels_true = [randint(0,10) for _ in range(100)]
labels_pred = [randint(0,10) for _ in range(100)]
But it says its caching intermediate results, so I don't know what to make of it. The differences seem very small anyways.

@jnothman
Copy link
Member

Yes, the point is to check the small number of clusters case. I don't have enough time for reviewing all the issues I have open, though I'd like to see this merged for 0.18, in particular if we can deprecate the max_n_classes which I find inappropriate! :) So if you clearly can make the case for whether or not it's safe to always use sparse=True, perhaps with some real clustering of (real or artificial) data, and post the results here, that would be great.

@amueller
Copy link
Member

I though max_n_classes was a hack to check if someone provided floating point numbers? (I didn't like it either ;).
What happens now if someone passes (unique) floats? No blow-up, I guess, but some garbage result. Can we maybe catch that?

@jnothman
Copy link
Member

max_n_classes is bad API design if what we want is validation. Passing floats to a cluster scorer is a pretty bad violation of the documented usage... We can use type_of_target and throw an error, but we don't need to muddy our API with unnecessary parameters.

@amueller
Copy link
Member

I don't think it's a great design either. I just wanted to make sure that we can catch the problem it was intended to solve (IIRC). passing a float is bad, but crashing the computer because of ram issues (I guess that will be less of an issue with sparse matrices) is also bad.

@amueller amueller added this to the 0.18 milestone Aug 26, 2016
@jnothman
Copy link
Member

Putting regression output into a clustering metric is very unlikely to result in problems if a sparse matrix is used internally.

# Compute contingency matrix if we weren't given it
if contingency is None:
contingency = contingency_matrix(labels_true, labels_pred,
max_n_classes=max_n_classes)
Copy link
Member
@ogrisel ogrisel Sep 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still builds a dense contingency matrix by default. Maybe we should have an internal heuristic: if n_classes is small we use a dense array (because it's faster to compute) and if it's larger than than some threshold we switch to sparse.

Or we could decide to always build sparse contingency matrices internally to keep the code simpler and expect that a difference in speed will never exceed a couple of seconds for small to medium inputs while very large input will only be addressable via sparse contingency matrices anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm benchmarking now. will post a WIP to supersede this one.

@ogrisel
Copy link
Member
ogrisel commented Sep 14, 2016

Putting regression output into a clustering metric is very unlikely to result in problems if a sparse matrix is used internally.

I agree. We could still issue a warning if n_samples == n_classes and input types are float but that should not crash.

@jnothman
Copy link
Member

Completed in #7419

@jnothman jnothman closed this Sep 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0