-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Use sparse contingency matrix for supervised cluster metrics #7203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove max_n_classes option
We can't just drop |
@jnothman Ok, now both are in |
@@ -46,7 +48,7 @@ def check_clusterings(labels_true, labels_pred): | |||
return labels_true, labels_pred | |||
|
|||
|
|||
def contingency_matrix(labels_true, labels_pred, eps=None, max_n_classes=5000): | |||
def contingency_matrix(labels_true, labels_pred, eps=None, max_n_classes=5000, sparse=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 max line length = 79
I reckon we should just always use Corollaries:
|
# For a sparse matrix | ||
sum_comb_c = sum(comb2(n_c) for n_c in np.array(contingency.sum(axis=1))) | ||
sum_comb_k = sum(comb2(n_k) for n_k in np.array(contingency.sum(axis=0)).T) | ||
sum_comb = sum(comb2(n_ij) for n_ij in find(contingency)[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a _data_matrix
surely contingency.data.flatten()
will do without the more expensive find
operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its a coo_matrix, data contains only 1s
In [121]: C.toarray()
Out[121]:
array([[5, 1, 0],
[1, 4, 1],
[0, 2, 3]])
In [122]: C.data
Out[122]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
I can do:
In [123]: C.tocsr().data
Out[123]: array([5, 1, 1, 4, 1, 2, 3], dtype=int64)
?
Thanks for the PEP8 fixes, @stuppie. What do you think of my always sparse suggestion? With the current version, it should be straightforward to run a few benchmarks to assess whether |
@jnothman I think this would be fine expect that |
@jnothman Here's a rough comparison: https://gist.github.com/stuppie/4d45459df3be477c5e46535aeb5c9b7e I also tried this with a small contingency matrix: |
Yes, the point is to check the small number of clusters case. I don't have enough time for reviewing all the issues I have open, though I'd like to see this merged for 0.18, in particular if we can deprecate the |
I though |
|
I don't think it's a great design either. I just wanted to make sure that we can catch the problem it was intended to solve (IIRC). passing a float is bad, but crashing the computer because of ram issues (I guess that will be less of an issue with sparse matrices) is also bad. |
Putting regression output into a clustering metric is very unlikely to result in problems if a sparse matrix is used internally. |
# Compute contingency matrix if we weren't given it | ||
if contingency is None: | ||
contingency = contingency_matrix(labels_true, labels_pred, | ||
max_n_classes=max_n_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still builds a dense contingency matrix by default. Maybe we should have an internal heuristic: if n_classes is small we use a dense array (because it's faster to compute) and if it's larger than than some threshold we switch to sparse.
Or we could decide to always build sparse contingency matrices internally to keep the code simpler and expect that a difference in speed will never exceed a couple of seconds for small to medium inputs while very large input will only be addressable via sparse contingency matrices anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm benchmarking now. will post a WIP to supersede this one.
I agree. We could still issue a warning if |
Completed in #7419 |
See #4788
Replaces
max_n_classes
In sklearn.metrics.cluster.supervised
With large numbers of clusters (approx. >100k), construction of the contingency matrix runs out of memory (throws MemoryError).
Using a sparse matrix instead allows construction of the contingency matrix.
Modified adjusted_rand_score
homogeneity_completeness_v_measure
mutual_info_score
to accept a sparse matrix.