-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Bi-/coclustering API prevents scalability #2484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ping @kemaleren. |
I'll take care of it. It should not require too much work, I think. |
Hi, Correct me if im wrong, instead of implementing a 2d array each for rows_ and columns_ , they must be 1d arrays each with rows_[i] and columns_[i] containing the index of the cluster it belongs to right ? |
According to the documentation, Storing this as a sparse indicator matrix may be appropriate. |
So you mean to say the rows_ and columns_ should be converted to sparse ? |
Well, they should be stored sparsely at least, which is what "1d arrays each with rows_[i] and columns_[i] containing the index of the cluster it belongs to" would be doing anyway. Using a sparse matrix just provides an array-like interface to query "is feature i in cluster j?" or "what features are in cluster j?" or "what clusters is feature i in?". |
Yes I understood, do you have any sparse formats in mind ? |
Sure. Seeing as all the methods currently available work through |
Or choose the one that's easiest to construct. I don't know the code well enough. |
Yup right, CSR matrices, i would like to open a WIP on this one On Wed, Mar 19, 2014 at 8:03 AM, jnothman notifications@github.com wrote:
|
It doesn't have to be CSR. Two 2-d arrays of indices would suffice, I think. If these are kept in sorted order, |
Sorry, I misunderstood a docstring comment and thought that this may actually assign multiple clusters to each cluster. It's only single cluster assignment. There are already simple assignment vectors, I don't understand your code snippet (what is |
Never mind, that code is irrelevant if clusters are singly assigned to each point. We just need to store two arrays of indices. The code can easily be changed by putting in a simple transformation. Then, the implementation won't scale terribly well but at least we fixed the API to make optimizations possible. |
I note that the biclustering metric as implemented operates over the |
I just noticed that for the existing spectral biclustering implementations, Is there any reason we need to keep this indicator format? @kemaleren? |
I note also that |
@jnothman
|
I think this needs to be more explicit in the documentation, and certainly On 19 August 2014 06:48, Kemal Eren notifications@github.com wrote:
|
I'm not sure yet if this is the same problem, but I just tried running the |
Profiling with kernprof shows that cut = (X[row_complement[:, np.newaxis], cols].sum() +
X[rows[:, np.newaxis], col_complement].sum()) are taking up 92.5% of the time in this function (but I only ran the function for two iterations, which took 14min with profiling on). I must admit I don't even understand what the function is doing. From the example docs, it seems to compute a "normalized cut" of something (?) |
Since we're not sure how often biclustering algorithms are used and we might deprecate them (#9608), adding features to them would be out of scope. |
The biclustering and coclustering estimators promise to store boolean arrays
rows_
andcolumns_
, of sizen_clusters * n_samples
andn_clusters * n_features
, respectively, and convert these to indices only on demand. For use with large, sparse matrices and large numbers of clusters, these should be arrays of indices rather than boolean masks.The text was updated successfully, but these errors were encountered: