8000 [WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering by maniteja123 · Pull Request #6141 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering #6141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

maniteja123
Copy link
Contributor
@maniteja123 maniteja123 commented Jan 8, 2016

Closes gh-6126

  • Fix fit method for SpectralBiclustering and SpectralCoclustering.
  • Make them pass the common tests
  • Minimal changes to the code to fix some test failures (they might not be completely the right solution, but the bicluster tests are still passing)

Please do have a look at the changes and let me know if changes need to be done. I have also tried to address the comments. Hope they address the issues.


This change is Reviewable

@maniteja123
Copy link
Contributor Author

Please let me know if it is a good idea to chain the returning of the self object ?

Thanks

@agramfort
Copy link
Member

this was not caught by common tests?

@MechCoder
Copy link
Member

@agramfort there is a DONT_TEST list ;)

@agramfort
Copy link
Member

then it should be updated in this PR :)

@maniteja123
Copy link
Contributor Author

I am sorry if I am missing something but I couldn't see the BaseSpectral class or any of its subclasses in the DONT_TEST list in testing.py, and it would also help if anyone could point to the common test which checks for this.
Thanks.

@TomDLT
Copy link
Member
TomDLT commented Jan 13, 2016

Subclasses of BiclusterMixin are not tested in the common tests, since they use fit(self, X) instead of fit(self, X, y). (see the estimator check)

@maniteja123
Copy link
Contributor Author

Thanks for the information. This means none of the non_meta estimator checks are being done for subclasses of BiClusterMixin. I understand that one of the reasons is the working of `fit`` method as mentioned by @TomDLT above.

What is the best thing to do now ? Would it be better to just add a test to address this issue or actually look at all the tests and figure out the ones which are doable rather than skipping all of them altogether ?

Will move forward based on the opinion of experts. Thanks.

@amueller
Copy link
Member

@TomDLT that is not the reason. The reason is that they take a distance matrix, not a data matrix, methinks. They should totally take a y. I'm not sure how to improve upon this.
Maybe we can "just" remove them from the DONT_TEST? I wasn't clear on the requirements on X in biclustering when I expanded the common tests, I think.

@amueller
Copy link
Member

Fix looks good, but we should try and expand the tests.

@maniteja123
Copy link
Contributor Author

I suppose it would be better to expand the tests, but right now none of the non_meta_estimator checks are being done for the BiClusterMixin classes. What would be the best way to move forward ?
Thanks

@amueller
Copy link
Member

What happens if you remove the classes from DONT_TEST?

@maniteja123
Copy link
Contributor Author

Thanks for the reply! I haven't tried that. Will do that and get back.

@maniteja123
Copy link
Contributor Author

Ah sorry was in my sleep then :) It seems that this causes the non meta estimator checks to not be performed for classes of BiClusterMixin.

@maniteja123
Copy link
Contributor Author

Hi, sorry for the delay. If common tests are allowed for these classes, the only error as of now is with regard to fit

TypeError: fit() takes exactly 2 arguments (3 given)

@amueller
Copy link
Member
amueller commented Feb 9, 2016

Then there should be a y=None in the fit method.

@maniteja123
Copy link
Contributor Author

Sorry for the trivial doubt but should I modify the signature of the fit method in the BiCluster classes?

@maniteja123
Copy link
Contributor Author

Sorry for disturbing again, but could you clarify what you mean by having y=None ? Thanks.

@TomDLT
Copy link
Member
TomDLT commented Feb 23, 2016
def fit(self, X, y=None):
    ...

@maniteja123
Copy link
Contributor Author

@TomDLT thanks for the input. What would be better to mention in the documentation regarding the parameter y ? Thanks !

@TomDLT
Copy link
Member
TomDLT commented Feb 23, 2016

You don't need to mention it.
Do the class go through the non-meta estimator tests when you add y=None?

@maniteja123
Copy link
Contributor Author

Thanks for the clarification. Yeah, the class goes through the non-meta estimator checks after this change. Sorry didn't push the modified code. Wanted to clarify first and then push.

@maniteja123
Copy link
Contributor Author

@TomDLT, it seems that the tests are failing in travis but not on my local system. Any help regarding the reason for this behavior ? Do I need to rebase ?

@amueller
Copy link
Member

sorry for the delay. Can you rebase and see where the error comes from?

@maniteja123
Copy link
Contributor Author

Sure I will rebase and try to debug the error but in case anyone wants to reproduce, AFAIR they will occur when you remove the condition for the non meta estimator checks to skip bicluster modules.

@maniteja123
Copy link
Contributor Author

Hi, I have rebased on master and tried to fix some of the failing tests. In particular when the input is 1D, the svd decomposition in SpectralBiClustering discards the only feature/sample in the input. I suppose the checks weren't written to handle this case. Also in SpectralCoclustering the svd decompostion and further kmeans clustering is failing because of nan and inf in the decomposed vectors. It would be really helpful if someone can look into this and let me know if we should skip these tests or make some modifications for biclustering algorithms. Thanks !

@amueller
Copy link
Member

We should make modifications. If n_samples < n_row_clusters or n_features < n_column_clusters we should raise an error. That can be done in check_array with min_samples and min_features maybe (or otherwise to give a more explicit error message).

@amueller
Copy link
Member

@maniteja123 are you still working on this?

@maniteja123
Copy link
Contributor Author

Hi @amueller, sorry for not completing it. Would it be okay to work on it tomorrow and implement your suggestions ?

@maniteja123 maniteja123 changed the title MAINT: Return self for fit in Spectral Biclustering and CoClustering [MRG] MAINT: Return self for fit in Spectral Biclustering and CoClustering Sep 15, 2016
@maniteja123
Copy link
Contributor Author
maniteja123 commented Sep 15, 2016

Hi, I have tried to get this PR to merge state but have done some changes to the code of bicluster package for the tests to pass. In particular

  1. It was failing mostly in cases of either 1 feature or 1 sample. I couldn't come up with a better solution rather than specifically checking for dimensions.
  2. In sparse case, the SpectralBiclustering class was failing due to nan and inf in the data and there is a finite check. I skipped it right now but I suppose sparse should be handled

Please do have a look at it and let me know if anything else could be done here. Thanks.

row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0))).squeeze()
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1)))
if row_diag.shape[0] != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you explain this change. It's been a while since I looked at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could run squeeze and then reshape. Or, probably better, make sure min_samples=2 in check_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe actually ensure n_samples >= n_clusters and n_clusters > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I suppose this will take care of this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. are you still working on this? I'd like to get this into 0.12.1, which I'm planning to do over the weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amueller sorry for the delay. I will really try to get this into 0.18.1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking into this, the failing common tests set n_clusters = 1. So if we ensure this, it might still need refactoring to the common tests. In particular, they are check_fit2d_1sample, check_fit1d_1feature and check_fit1d_1sample in estimator_checks. What would be better to do here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the tests seem to be odd. In particular the comment is wrong. We should allow estimators to raise the error that is raised by check_X_y with min_samples.

@@ -156,6 +169,8 @@ def _svd(self, array, n_components, n_discard):

assert_all_finite(u)
assert_all_finite(vt)
if u.shape[1] == 1 and vt.shape[0] == 1:
Copy link
Member
@amueller amueller Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, explain please? Can you also add a comment maybe? So their's only one row in X?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking number of samples in the beginning should fix this, too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU this is happening when n_components is 1 and even that is getting discarded. So I suppose we can check n_discard < n_components. Please do let me know what you think. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the algorithm make sense for n_components = 1? I should work, right? Who's the expert on these algorithms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the number of support vectors is increased here: https://github.com/scikit-learn/scikit-learn/blob/65a4de5/sklearn/cluster/bicluster.py#L442 so it should work, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who's the expert on these algorithms?

Well, they were implemented by @kemaleren, and it looks like @larsmans attempted some cleanup, but I've had the impression for a while that they are somewhat orphaned, given how long issues like #2484 have been around.

@@ -293,7 +293,9 @@ def set_testing_parameters(estimator):
if "decision_function_shape" in params:
# SVC
estimator.set_params(decision_function_shape='ovo')

if "n_best" in params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, makes sense. Shouldn't it be set to that by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry could you clarify what should be set by default ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_best should be set to n_components by default, I think. Seems like a usability issue in the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. It seems right. Should I add this logic to the class then ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do it

@@ -340,6 +342,8 @@ def check_estimator_sparse_data(name, Estimator):
X_csr = sparse.csr_matrix(X)
y = (4 * rng.rand(40)).astype(np.int)
for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
if name == 'SpectralCoclustering':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It should raise an appropriate error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even here, similar to above scenario this call is failing. This is the stacktrace

sklearn/cluster/bicluster.py:298: RuntimeWarning: invalid value encountered in multiply
  z = np.vstack((row_diag[:, np.newaxis] * u,
Estimator SpectralCoclustering doesn't seem to fail gracefully on sparse data: it should raise a TypeError if sparse input is explicitly not supported.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sklearn/utils/estimator_checks.py", line 353, in check_estimator_sparse_data
    estimator.fit(X, y)
  File "sklearn/cluster/bicluster.py", line 136, in fit
    return self._fit(X)
  File "sklearn/cluster/bicluster.py", line 301, in _fit
    _, labels = self._k_means(z, self.n_clusters)
  File "sklearn/cluster/bicluster.py", line 189, in _k_means
    model.fit(data)
  File "sklearn/cluster/k_means_.py", line 880, in fit
    X = self._check_fit_data(X)
  File "sklearn/cluster/k_means_.py", line 855, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
  File "sklearn/utils/validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "sklearn/utils/validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well is that a consequence of the data content or the data being sparse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is due to the data content rather than sparsity.

@@ -687,7 +691,11 @@ def check_fit_score_takes_y(name, Estimator):
@ignore_warnings
def check_estimators_dtypes(name, Estimator):
rnd = np.random.RandomState(0)
X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32)
if name in ["BoxCoxTransformer", "SpectralCoclustering",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array being randomly generated contains 0s which is causing inf and nan which results in ValueError due to this. The only approach was to generate positive numbers. Is there something I am doing wrong here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoxCoxTransformer is not in master yet, right?
Why do zeros create inf an nan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yeah BoxCoxTransformer is not in master yet, but the tests were failing in my local machine when I was switching between the branches. will remove that. But in BiCluster module, the svd vectors u and v are failing finite tests when there are zeroes in X_train_32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the algorithm but that seems strange to me. Have you understood why this is happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even I am not familiar at all with the algorithm, but when the current code is executed, normalized_data here has NaN and fails finite check during this call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that's when an entire row is zero?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still the BoxCoxTransformer here. I guess we could leave it like this for now with a FIXME and open an issue. It's more important to me that the basic API is correct than having all the edge-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer

@@ -1318,7 +1326,7 @@ def check_class_weight_balanced_linear_classifier(name, Classifier):


def check_estimators_overwrite_params(name, Estimator):
X, y = make_blobs(random_state=0, n_samples=9)
X, y = make_blobs(random_state=0, n_samples=9, n_features=3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bicluster classes, there is singular vector vt of shape [n_features, n_components] on which k_means is applied for which the default value for n_clusters is 3 which should be greater than equal to the number of samples of vt, which is n_features(which has default of 2 in make_blobs), hence changed it to 3. And the other estimators were still passing, so I hoped it would be okay.

@amueller
Copy link
Member

Thanks for the changes. It would be great if you could explain them a bit more, I haven't looked at this in a while.

Copy link
Contributor Author
@maniteja123 maniteja123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amueller , thanks for the review. I tried to give an explanation to the best of my understanding. Please do have a look at your convenience and let me know if something can be made better. Thanks.

@@ -687,7 +691,11 @@ def check_fit_score_takes_y(name, Estimator):
@ignore_warnings
def check_estimators_dtypes(name, Estimator):
rnd = np.random.RandomState(0)
X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32)
if name in ["BoxCoxTransformer", "SpectralCoclustering",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array being randomly generated contains 0s which is causing inf and nan which results in ValueError due to this. The only approach was to generate positive numbers. Is there something I am doing wrong here ?

@@ -1318,7 +1326,7 @@ def check_class_weight_balanced_linear_classifier(name, Classifier):


def check_estimators_overwrite_params(name, Estimator):
X, y = make_blobs(random_state=0, n_samples=9)
X, y = make_blobs(random_state=0, n_samples=9, n_features=3)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bicluster classes, there is singular vector vt of shape [n_features, n_components] on which k_means is applied for which the default value for n_clusters is 3 which should be greater than equal to the number of samples of vt, which is n_features(which has default of 2 in make_blobs), hence changed it to 3. And the other estimators were still passing, so I hoped it would be okay.

@amueller
Copy link
Member
amueller commented Oct 27, 2016

On 10/27/2016 01:36 PM, Maniteja Nandana wrote:

Sorry could you clarify what should be set by default ?

Sorry I can't find where this comment was.

@maniteja123
Copy link
Contributor Author

I was referring to your comment here about the n_best parameter for bicluster classes.

@raghavrv
Copy link
Member
raghavrv commented Nov 1, 2016

@maniteja123 Could you refactor the return self part out of this PR into a separate PR so we can merge it in time for 0.18.1... Fixing the testing setup can be done for 0.19... if @amueller is okay with that...

@maniteja123
Copy link
Contributor Author

Sure @raghavrv , I can do that if that is okay. Sorry I don't think I can fix all that failures by 0.18.1 but will definitely do that. If anyone can do it by then, please feel free to take over.

@amueller
Copy link
Member
amueller commented Nov 2, 2016

I mean I can just push the two "return self" lines to master... but without any tests... well... I guess it's better than nothing?

@jnothman
Copy link
Member
jnothman commented Nov 2, 2016

do it

On 3 November 2016 at 06:56, Andreas Mueller notifications@github.com
wrote:

I mean I can just push the two "return self" lines to master... but
without any tests... well... I guess it's better than nothing?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#6141 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz62JyCJRXi5LR0tfVzqXmgzfuGs5Wks5q6OrggaJpZM4HBJfG
.

@maniteja123
Copy link
Contributor Author

Hi, I just created a PR at gh-7814 for returning self but not touching the tests. Please let me know if it is fine for the purpose of 0.18.1. Thanks.

@raghavrv raghavrv modified the milestones: 0.19, 0.18.1 Nov 6, 2016
@raghavrv raghavrv changed the title [MRG] MAINT: Return self for fit in Spectral Biclustering and CoClustering [WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering Nov 6, 2016
@jnothman
Copy link
Member

Please rebase / merge with master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SpectralBiclustering.fit doesn't return self
7 participants
0