8000 Bi-/coclustering API prevents scalability · Issue #2484 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Bi-/coclustering API prevents scalability #2484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larsmans opened this issue Sep 30, 2013 · 21 comments
Closed

Bi-/coclustering API prevents scalability #2484

larsmans opened this issue Sep 30, 2013 · 21 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:cluster Needs Decision Requires decision

Comments

@larsmans
Copy link
Member

The biclustering and coclustering estimators promise to store boolean arrays rows_ and columns_, of size n_clusters * n_samples and n_clusters * n_features, respectively, and convert these to indices only on demand. For use with large, sparse matrices and large numbers of clusters, these should be arrays of indices rather than boolean masks.

@larsmans
Copy link
Member Author

Ping @kemaleren.

@kemaleren
Copy link
Contributor

I'll take care of it. It should not require too much work, I think.

@kaushik94
Copy link
Contributor

Hi,

Correct me if im wrong, instead of implementing a 2d array each for rows_ and columns_ , they must be 1d arrays each with rows_[i] and columns_[i] containing the index of the cluster it belongs to right ?

@jnothman
Copy link
Member

According to the documentation, SpectralBiclustering will assign each row (sample) and column (feature) to multiple clusters. So unless the number of clusters per sample and feature is fixed for a particular problem, it's difficult to store the data as an array, except if boolean.

Storing this as a sparse indicator matrix may be appropriate.

@kaushik94
Copy link
Contributor

So you mean to say the rows_ and columns_ should be converted to sparse ?

@jnothman
Copy link
Member

Well, they should be stored sparsely at least, which is what "1d arrays each with rows_[i] and columns_[i] containing the index of the cluster it belongs to" would be doing anyway. Using a sparse matrix just provides an array-like interface to query "is feature i in cluster j?" or "what features are in cluster j?" or "what clusters is feature i in?".

@kaushik94
Copy link
Contributor

Yes I understood, do you have any sparse formats in mind ?
What I feel is, we can't have a single best sparse matrix(like CSR, CSC) to support all the 3 types of queries with same efficiency. One of them will be less efficient(just my 2 cents). But a sparse structure will definitely enhance the performance, just want to know if there is a very efficient sparse format for this

@jnothman
Copy link
Member

Sure. Seeing as all the methods currently available work through get_indices which gets "row and column indices of the ith bicluster", that should suggest the answer.

@jnothman
Copy link
Member

Or choose the one that's easiest to construct. I don't know the code well enough.

@kaushik94
Copy link
Contributor

Yup right, CSR matrices, i would like to open a WIP on this one

On Wed, Mar 19, 2014 at 8:03 AM, jnothman notifications@github.com wrote:

Sure. Seeing as all the methods currently available work through
get_indices which gets "row and column indices of the ith bicluster",
that should suggest the answer.

Reply to this email directly or view it on GitHubhttps://github.com//issues/2484#issuecomment-38012005
.

@larsmans
Copy link
Member Author

It doesn't have to be CSR. Two 2-d arrays of indices would suffice, I think. If these are kept in sorted order, get_indices can be done with at most four binary searches.

@jnothman
Copy link
Member

Sorry, I misunderstood a docstring comment and thought that this may actually assign multiple clusters to each cluster. It's only single cluster assignment.

There are already simple assignment vectors, {column,row}_labels_. The attributes we're talking about redundantly store an indicator matrix, so that it's slightly easier to get back indices...

I don't understand your code snippet (what is rows? where is cluster used? search column 0 of rows?), but perhaps a list of arrays will suffice... If this data must be stored redundantly, at least it should be easy to use.

@larsmans
Copy link
Member Author

Never mind, that code is irrelevant if clusters are singly assigned to each point. We just need to store two arrays of indices.

The code can easily be changed by putting in a simple transformation. Then, the implementation won't scale terribly well but at least we fixed the API to make optimizations possible.

@jnothman
Copy link
Member

I note that the biclustering metric as implemented operates over the rows_ and columns_ rather than {row,column}_labels_ structures. So if we no longer materialise those structures, evaluation needs to change to be over the unbinarized representation, or to handle either.

@jnothman
Copy link
Member

I just noticed that for the existing spectral biclustering implementations, rows_ and columns_ indicator matrices are not correctly documented. They each have n_col_clusters * n_row_clusters rows, not n_col_clusters or n_row_clusters alone: one is tiled, the other repeated. This seems a strange data format.

Is there any reason we need to keep this indicator format? @kemaleren?

@jnothman
Copy link
Member

I note also that make_bicluster's row and column indicators have 3 rows for n_clusters=3 and make_checkerboard with the same parameters generates 9-row indicators!

@kemaleren
Copy link
Contributor

@jnothman SpectralCoclustering and SpectralBiclustering have different bicluster semantics. In SpectralCoclustering, each row and each column are members of exactly one bicluster. In SpectralBiclustering, each row is a member of every column cluster, and vice versa, so the number of biclusters is the product of the number of row and column clusters.

make_biclusters() follows the semantics of SpectralCoclustering, so n_clusters=3 is shorthand for 3 row clusters and 3 column clusters, thus 3 biclusters total.

make_checkerboard() follows the semantics of SpectralBiclustering, so n_clusters=3 is shorthand for 9 biclusters total. You can also give it a tuple, so n_clusters=(3, 4) means 3 row clusters, 4 column clusters, and 12 biclusters.

@jnothman
Copy link
Member

I think this needs to be more explicit in the documentation, and certainly
in the docstrings.

On 19 August 2014 06:48, Kemal Eren notifications@github.com wrote:

@jnothman https://github.com/jnothman SpectralCoclustering and
SpectralBiclustering have different bicluster semantics. In
SpectralCoclustering, each row and each column are members of exactly one
bicluster. In SpectralBiclustering, each row is a member of every column
cluster, and vice versa, so the number of biclusters is the product of the
number of row and column clusters.

make_biclusters() follows the semantics of SpectralCoclustering, so
n_clusters=3 is shorthand for 3 row clusters and 3 column clusters, thus
3 biclusters total.

make_checkerboard() follows the semantics of SpectralBiclustering, so
n_clusters=3 is shorthand for 9 biclusters total. You can also give it a
tuple, so `n_clusters=(3, 4) means 3 row clusters, 4 column clusters, and
12 biclusters.


Reply to this email directly or view it on GitHub
#2484 (comment)
.

@larsmans
Copy link
Member Author
larsmans commented Sep 4, 2014

I'm not sure yet if this is the same problem, but I just tried running the bicluster_newsgroups.py example and I killed it after ~fifteen minutes, when it has done 5/20 iterations of bicluster_ncut. The function itself is entirely vectorized, and still it's still dead slow.

@larsmans
Copy link
Member Author
larsmans commented Sep 4, 2014

Profiling with kernprof shows that

    cut = (X[row_complement[:, np.newaxis], cols].sum() +
           X[rows[:, np.newaxis], col_complement].sum())

are taking up 92.5% of the time in this function (but I only ran the function for two iterations, which took 14min with profiling on).

I must admit I don't even understand what the function is doing. From the example docs, it seems to compute a "normalized cut" of something (?)

@adrinjalali
Copy link
Member

Since we're not sure how often biclustering algorithms are used and we might deprecate them (#9608), adding features to them would be out of scope.

@adrinjalali adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:cluster Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

6 participants
0