Bi-/coclustering API prevents scalability #2484

larsmans · 2013-09-30T10:26:07Z

The biclustering and coclustering estimators promise to store boolean arrays rows_ and columns_, of size n_clusters * n_samples and n_clusters * n_features, respectively, and convert these to indices only on demand. For use with large, sparse matrices and large numbers of clusters, these should be arrays of indices rather than boolean masks.

The text was updated successfully, but these errors were encountered:

larsmans · 2013-09-30T10:27:15Z

Ping @kemaleren.

kemaleren · 2013-10-02T20:10:49Z

I'll take care of it. It should not require too much work, I think.

kaushik94 · 2014-03-19T00:55:00Z

Hi,

Correct me if im wrong, instead of implementing a 2d array each for rows_ and columns_ , they must be 1d arrays each with rows_[i] and columns_[i] containing the index of the cluster it belongs to right ?

jnothman · 2014-03-19T01:26:33Z

According to the documentation, SpectralBiclustering will assign each row (sample) and column (feature) to multiple clusters. So unless the number of clusters per sample and feature is fixed for a particular problem, it's difficult to store the data as an array, except if boolean.

Storing this as a sparse indicator matrix may be appropriate.

kaushik94 · 2014-03-19T01:55:11Z

So you mean to say the rows_ and columns_ should be converted to sparse ?

jnothman · 2014-03-19T02:04:50Z

Well, they should be stored sparsely at least, which is what "1d arrays each with rows_[i] and columns_[i] containing the index of the cluster it belongs to" would be doing anyway. Using a sparse matrix just provides an array-like interface to query "is feature i in cluster j?" or "what features are in cluster j?" or "what clusters is feature i in?".

kaushik94 · 2014-03-19T02:19:36Z

Yes I understood, do you have any sparse formats in mind ?
What I feel is, we can't have a single best sparse matrix(like CSR, CSC) to support all the 3 types of queries with same efficiency. One of them will be less efficient(just my 2 cents). But a sparse structure will definitely enhance the performance, just want to know if there is a very efficient sparse format for this

jnothman · 2014-03-19T02:33:08Z

Sure. Seeing as all the methods currently available work through get_indices which gets "row and column indices of the ith bicluster", that should suggest the answer.

jnothman · 2014-03-19T02:36:43Z

Or choose the one that's easiest to construct. I don't know the code well enough.

kaushik94 · 2014-03-19T02:39:14Z

Yup right, CSR matrices, i would like to open a WIP on this one

On Wed, Mar 19, 2014 at 8:03 AM, jnothman notifications@github.com wrote:

Sure. Seeing as all the methods currently available work through
get_indices which gets "row and column indices of the ith bicluster",
that should suggest the answer.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2484#issuecomment-38012005
.

larsmans · 2014-03-19T10:01:35Z

It doesn't have to be CSR. Two 2-d arrays of indices would suffice, I think. If these are kept in sorted order, get_indices can be done with at most four binary searches.

jnothman · 2014-03-19T10:37:08Z

Sorry, I misunderstood a docstring comment and thought that this may actually assign multiple clusters to each cluster. It's only single cluster assignment.

There are already simple assignment vectors, {column,row}_labels_. The attributes we're talking about redundantly store an indicator matrix, so that it's slightly easier to get back indices...

I don't understand your code snippet (what is rows? where is cluster used? search column 0 of rows?), but perhaps a list of arrays will suffice... If this data must be stored redundantly, at least it should be easy to use.

larsmans · 2014-03-19T10:45:34Z

Never mind, that code is irrelevant if clusters are singly assigned to each point. We just need to store two arrays of indices.

The code can easily be changed by putting in a simple transformation. Then, the implementation won't scale terribly well but at least we fixed the API to make optimizations possible.

jnothman · 2014-07-22T04:42:07Z

I note that the biclustering metric as implemented operates over the rows_ and columns_ rather than {row,column}_labels_ structures. So if we no longer materialise those structures, evaluation needs to change to be over the unbinarized representation, or to handle either.

jnothman · 2014-08-18T09:09:22Z

I just noticed that for the existing spectral biclustering implementations, rows_ and columns_ indicator matrices are not correctly documented. They each have n_col_clusters * n_row_clusters rows, not n_col_clusters or n_row_clusters alone: one is tiled, the other repeated. This seems a strange data format.

Is there any reason we need to keep this indicator format? @kemaleren?

jnothman · 2014-08-18T09:53:39Z

I note also that make_bicluster's row and column indicators have 3 rows for n_clusters=3 and make_checkerboard with the same parameters generates 9-row indicators!

kemaleren · 2014-08-18T20:48:58Z

@jnothman SpectralCoclustering and SpectralBiclustering have different bicluster semantics. In SpectralCoclustering, each row and each column are members of exactly one bicluster. In SpectralBiclustering, each row is a member of every column cluster, and vice versa, so the number of biclusters is the product of the number of row and column clusters.

make_biclusters() follows the semantics of SpectralCoclustering, so n_clusters=3 is shorthand for 3 row clusters and 3 column clusters, thus 3 biclusters total.

make_checkerboard() follows the semantics of SpectralBiclustering, so n_clusters=3 is shorthand for 9 biclusters total. You can also give it a tuple, so n_clusters=(3, 4) means 3 row clusters, 4 column clusters, and 12 biclusters.

jnothman · 2014-08-18T22:51:27Z

I think this needs to be more explicit in the documentation, and certainly
in the docstrings.

On 19 August 2014 06:48, Kemal Eren notifications@github.com wrote:

@jnothman https://github.com/jnothman SpectralCoclustering and
SpectralBiclustering have different bicluster semantics. In
SpectralCoclustering, each row and each column are members of exactly one
bicluster. In SpectralBiclustering, each row is a member of every column
cluster, and vice versa, so the number of biclusters is the product of the
number of row and column clusters.

make_biclusters() follows the semantics of SpectralCoclustering, so
n_clusters=3 is shorthand for 3 row clusters and 3 column clusters, thus
3 biclusters total.

make_checkerboard() follows the semantics of SpectralBiclustering, so
n_clusters=3 is shorthand for 9 biclusters total. You can also give it a
tuple, so `n_clusters=(3, 4) means 3 row clusters, 4 column clusters, and
12 biclusters.

—
Reply to this email directly or view it on GitHub
#2484 (comment)
.

larsmans · 2014-09-04T12:41:07Z

I'm not sure yet if this is the same problem, but I just tried running the bicluster_newsgroups.py example and I killed it after ~fifteen minutes, when it has done 5/20 iterations of bicluster_ncut. The function itself is entirely vectorized, and still it's still dead slow.

larsmans · 2014-09-04T12:52:59Z

Profiling with kernprof shows that

    cut = (X[row_complement[:, np.newaxis], cols].sum() +
           X[rows[:, np.newaxis], col_complement].sum())

are taking up 92.5% of the time in this function (but I only ran the function for two iterations, which took 14min with profiling on).

I must admit I don't even understand what the function is doing. From the example docs, it seems to compute a "normalized cut" of something (?)

adrinjalali · 2024-04-17T15:14:41Z

Since we're not sure how often biclustering algorithms are used and we might deprecate them (#9608), adding features to them would be out of scope.

jnothman mentioned this issue Nov 12, 2014

[MRG+1] FIX consensus score on non-square similarity matrices #3640

Merged

jnothman mentioned this issue Sep 17, 2015

[MRG] Implement FABIA biclustering algorithm #5287

Closed

jnothman mentioned this issue Oct 18, 2016

[WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering #6141

Closed

jnothman mentioned this issue Aug 23, 2017

deprecate biclustering? #9608

Open

cmarmo added the Needs Decision Requires decision label Dec 14, 2020

cmarmo added the module:cluster label Dec 3, 2021

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 23, 2022

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Apr 23, 2022

adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024

github-project-automation bot moved this from Todo📬 to Done🚀 in Quansight's scikit-learn Project Board Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bi-/coclustering API prevents scalability #2484

Bi-/coclustering API prevents scalability #2484

Bi-/coclustering API prevents scalability #2484

Bi-/coclustering API prevents scalability #2484

Comments