- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 26.4k
[WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering #6141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Please let me know if it is a good idea to chain the returning of the  Thanks | 
| this was not caught by common tests? | 
| @agramfort there is a  | 
| then it should be updated in this PR :) | 
| I am sorry if I am missing something but I couldn't see the  | 
| Subclasses of  | 
| Thanks for the information. This means none of the non_meta estimator checks are being done for subclasses of  What is the best thing to do now ? Would it be better to just add a test to address this issue or actually look at all the tests and figure out the ones which are doable rather than skipping all of them altogether ? Will move forward based on the opinion of experts. Thanks. | 
| @TomDLT that is not the reason. The reason is that they take a distance matrix, not a data matrix, methinks. They should totally take a  | 
| Fix looks good, but we should try and expand the tests. | 
| I suppose it would be better to expand the tests, but right now none of the non_meta_estimator checks are being done for the BiClusterMixin classes. What would be the best way to move forward ? | 
| What happens if you remove the classes from  | 
| Thanks for the reply! I haven't tried that. Will do that and get back. | 
| Ah sorry was in my sleep then :) It seems that this causes the non meta estimator checks to not be performed for classes of  | 
| Hi, sorry for the delay. If common tests are allowed for these classes, the only error as of now is with regard to fit  | 
| Then there should be a  | 
| Sorry for the trivial doubt but should I modify the signature of the  | 
| Sorry for disturbing again, but could you clarify what you mean by having  | 
| def fit(self, X, y=None):
    ... | 
| @TomDLT thanks for the input. What would be better to mention in the documentation regarding the parameter  | 
| You don't need to mention it. | 
| Thanks for the clarification. Yeah, the class goes through the non-meta estimator checks after this change. Sorry didn't push the modified code. Wanted to clarify first and then push. | 
| @TomDLT, it seems that the tests are failing in travis but not on my local system. Any help regarding the reason for this behavior ? Do I need to rebase ? | 
| sorry for the delay. Can you rebase and see where the error comes from? | 
| Sure I will rebase and try to debug the error but in case anyone wants to reproduce, AFAIR they will occur when you remove the condition for the non meta estimator checks to skip bicluster modules. | 
| Hi, I have rebased on master and tried to fix some of the failing tests. In particular when the input is 1D, the svd decomposition in SpectralBiClustering discards the only feature/sample in the input. I suppose the checks weren't written to handle this case. Also in SpectralCoclustering the svd decompostion and further kmeans clustering is failing because of  | 
| We should make modifications. If  | 
| @maniteja123 are you still working on this? | 
| Hi @amueller, sorry for not completing it. Would it be okay to work on it tomorrow and implement your suggestions ? | 
dc57cf6    to
    f71f6d6      
    Compare
  
    | Hi, I have tried to get this PR to merge state but have done some changes to the code of bicluster package for the tests to pass. In particular 
 Please do have a look at it and let me know if anything else could be done here. Thanks. | 
| row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze() | ||
| col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0))).squeeze() | ||
| row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))) | ||
| if row_diag.shape[0] != 1: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, can you explain this change. It's been a while since I looked at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could run squeeze and then reshape. Or, probably better, make sure min_samples=2 in check_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe actually ensure n_samples >= n_clusters and n_clusters > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I suppose this will take care of this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. are you still working on this? I'd like to get this into 0.12.1, which I'm planning to do over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amueller sorry for the delay. I will really try to get this into 0.18.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking into this, the failing common tests set n_clusters = 1. So if we ensure this, it might still need refactoring to the common tests. In particular, they are check_fit2d_1sample, check_fit1d_1feature and check_fit1d_1sample in estimator_checks. What would be better to do here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the tests seem to be odd. In particular the comment is wrong. We should allow estimators to raise the error that is raised by check_X_y with min_samples.
|  | ||
| assert_all_finite(u) | ||
| assert_all_finite(vt) | ||
| if u.shape[1] == 1 and vt.shape[0] == 1: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, explain please? Can you also add a comment maybe? So their's only one row in X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking number of samples in the beginning should fix this, too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU this is happening when n_components is 1 and even that is getting discarded. So I suppose we can check n_discard < n_components. Please do let me know what you think. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the algorithm make sense for n_components = 1? I should work, right? Who's the expert on these algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the number of support vectors is increased here: https://github.com/scikit-learn/scikit-learn/blob/65a4de5/sklearn/cluster/bicluster.py#L442 so it should work, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who's the expert on these algorithms?
Well, they were implemented by @kemaleren, and it looks like @larsmans attempted some cleanup, but I've had the impression for a while that they are somewhat orphaned, given how long issues like #2484 have been around.
| # SVC | ||
| estimator.set_params(decision_function_shape='ovo') | ||
|  | ||
| if "n_best" in params: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, makes sense. Shouldn't it be set to that by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry could you clarify what should be set by default ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_best should be set to n_components by default, I think. Seems like a usability issue in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. It seems right. Should I add this logic to the class then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will do it
| X_csr = sparse.csr_matrix(X) | ||
| y = (4 * rng.rand(40)).astype(np.int) | ||
| for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']: | ||
| if name == 'SpectralCoclustering': | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It should raise an appropriate error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even here, similar to above scenario this call is failing. This is the stacktrace
sklearn/cluster/bicluster.py:298: RuntimeWarning: invalid value encountered in multiply
  z = np.vstack((row_diag[:, np.newaxis] * u,
Estimator SpectralCoclustering doesn't seem to fail gracefully on sparse data: it should raise a TypeError if sparse input is explicitly not supported.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sklearn/utils/estimator_checks.py", line 353, in check_estimator_sparse_data
    estimator.fit(X, y)
  File "sklearn/cluster/bicluster.py", line 136, in fit
    return self._fit(X)
  File "sklearn/cluster/bicluster.py", line 301, in _fit
    _, labels = self._k_means(z, self.n_clusters)
  File "sklearn/cluster/bicluster.py", line 189, in _k_means
    model.fit(data)
  File "sklearn/cluster/k_means_.py", line 880, in fit
    X = self._check_fit_data(X)
  File "sklearn/cluster/k_means_.py", line 855, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
  File "sklearn/utils/validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "sklearn/utils/validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well is that a consequence of the data content or the data being sparse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is due to the data content rather than sparsity.
| def check_estimators_dtypes(name, Estimator): | ||
| rnd = np.random.RandomState(0) | ||
| X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32) | ||
| if name in ["BoxCoxTransformer", "SpectralCoclustering", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array being randomly generated contains 0s which is causing inf and nan which results in ValueError due to this. The only approach was to generate positive numbers. Is there something I am doing wrong here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BoxCoxTransformer is not in master yet, right?
Why do zeros create inf an nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry yeah BoxCoxTransformer is not in master yet, but the tests were failing in my local machine when I was switching between the branches. will remove that. But in BiCluster module, the svd vectors u and v are failing finite tests when there are zeroes in X_train_32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the algorithm but that seems strange to me. Have you understood why this is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that's when an entire row is zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still the BoxCoxTransformer here. I guess we could leave it like this for now with a FIXME and open an issue. It's more important to me that the basic API is correct than having all the edge-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer
|  | ||
| def check_estimators_overwrite_params(name, Estimator): | ||
| X, y = make_blobs(random_state=0, n_samples=9) | ||
| X, y = make_blobs(random_state=0, n_samples=9, n_features=3) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bicluster classes, there is singular vector vt of shape [n_features, n_components] on which k_means is applied for which the default value for n_clusters is 3 which should be greater than equal to the number of samples of vt, which is n_features(which has default of 2 in make_blobs), hence changed it to 3. And the other estimators were still passing, so I hoped it would be okay.
| Thanks for the changes. It would be great if you could explain them a bit more, I haven't looked at this in a while. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amueller , thanks for the review. I tried to give an explanation to the best of my understanding. Please do have a look at your convenience and let me know if something can be made better. Thanks.
| def check_estimators_dtypes(name, Estimator): | ||
| rnd = np.random.RandomState(0) | ||
| X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32) | ||
| if name in ["BoxCoxTransformer", "SpectralCoclustering", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array being randomly generated contains 0s which is causing inf and nan which results in ValueError due to this. The only approach was to generate positive numbers. Is there something I am doing wrong here ?
|  | ||
| def check_estimators_overwrite_params(name, Estimator): | ||
| X, y = make_blobs(random_state=0, n_samples=9) | ||
| X, y = make_blobs(random_state=0, n_samples=9, n_features=3) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bicluster classes, there is singular vector vt of shape [n_features, n_components] on which k_means is applied for which the default value for n_clusters is 3 which should be greater than equal to the number of samples of vt, which is n_features(which has default of 2 in make_blobs), hence changed it to 3. And the other estimators were still passing, so I hoped it would be okay.
bf459f4    to
    978c48b      
    Compare
  
    978c48b    to
    fe4a8aa      
    Compare
  
    | On 10/27/2016 01:36 PM, Maniteja Nandana wrote: 
 | 
| I was referring to your comment here about the  | 
| @maniteja123 Could you refactor the  | 
| Sure @raghavrv , I can do that if that is okay. Sorry I don't think I can fix all that failures by 0.18.1 but will definitely do that. If anyone can do it by then, please feel free to take over. | 
| I mean I can just push the two "return self" lines to master... but without any tests... well... I guess it's better than nothing? | 
| do it On 3 November 2016 at 06:56, Andreas Mueller notifications@github.com 
 | 
| Hi, I just created a PR at gh-7814 for returning  | 
| Please rebase / merge with master. | 
Closes gh-6126
fitmethod forSpectralBiclusteringandSpectralCoclustering.Please do have a look at the changes and let me know if changes need to be done. I have also tried to address the comments. Hope they address the issues.
This change is