10000 [WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering by maniteja123 · Pull Request #6141 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering #6141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 22 additions & 4 deletions sklearn/cluster/bicluster.py
3372
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,19 @@ def _scale_normalize(X):

"""
X = make_nonnegative(X)
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0))).squeeze()
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1)))
if row_diag.shape[0] != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you explain this change. It's been a while since I looked at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could run squeeze and then reshape. Or, probably better, make sure min_samples=2 in check_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe actually ensure n_samples >= n_clusters and n_clusters > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I suppose this will take care of this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. are you still working on this? I'd like to get this into 0.12.1, which I'm planning to do over the weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amueller sorry for the delay. I will really try to get this into 0.18.1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking into this, the failing common tests set n_clusters = 1. So if we ensure this, it might still need refactoring to the common tests. In particular, they are check_fit2d_1sample, check_fit1d_1feature and check_fit1d_1sample in estimator_checks. What would be better to do here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the tests seem to be odd. In particular the comment is wrong. We should allow estimators to raise the error that is raised by check_X_y with min_samples.

row_diag = row_diag.squeeze()

col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0)))
if col_diag.ndim == 1 and col_diag.shape[0] != 1:
col_diag = col_diag.squeeze()
if col_diag.ndim == 2 and col_diag.shape[1] != 1:
col_diag = col_diag.squeeze()

row_diag = np.where(np.isnan(row_diag), 0, row_diag)
col_diag = np.where(np.isnan(col_diag), 0, col_diag)

if issparse(X):
n_rows, n_cols = X.shape
r = dia_matrix((row_diag, [0]), shape=(n_rows, n_rows))
Expand Down Expand Up @@ -110,17 +119,21 @@ def _check_parameters(self):
" one of {1}.".format(self.svd_method,
legal_svd_methods))

def fit(self, X):
def fit(self, X, y=None):
"""Creates a biclustering for X.

Parameters
----------
X : array-like, shape (n_samples, n_features)

Returns
-------
self : object
Returns the instance itself.
"""
X = check_array(X, accept_sparse='csr', dtype=np.float64)
self._check_parameters()
self._fit(X)
return self._fit(X)

def _svd(self, array, n_components, n_discard):
"""Returns first `n_components` left and right singular
Expand Down Expand Up @@ -156,6 +169,8 @@ def _svd(self, array, n_components, n_discard):

assert_all_finite(u)
assert_all_finite(vt)
if u.shape[1] == 1 and vt.shape[0] == 1:
Copy link
Member
@amueller amueller Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, explain please? Can you also add a comment maybe? So their's only one row in X?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking number of samples in the beginning should fix this, too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU this is happening when n_components is 1 and even that is getting discarded. So I suppose we can check n_discard < n_components. Please do let me know what you think. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the algorithm make sense for n_components = 1? I should work, right? Who's the expert on these algorithms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the number of support vectors is increased here: https://github.com/scikit-learn/scikit-learn/blob/65a4de5/sklearn/cluster/bicluster.py#L442 so it should work, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who's the expert on these algorithms?

Well, they were implemented by @kemaleren, and it looks like @larsmans attempted some cleanup, but I've had the impression for a while that they are somewhat orphaned, given how long issues like #2484 have been around.

n_discard = 0
u = u[:, n_discard:]
vt = vt[n_discard:]
return u, vt.T
Expand Down Expand Up @@ -278,6 +293,7 @@ def _fit(self, X):
normalized_data, row_diag, col_diag = _scale_normalize(X)
n_sv = 1 + int(np.ceil(np.log2(self.n_clusters)))
u, v = self._svd(normalized_data, n_sv, n_discard=1)

z = np.vstack((row_diag[:, np.newaxis] * u,
col_diag[:, np.newaxis] * v))

Expand All @@ -291,6 +307,7 @@ def _fit(self, X):
for c in range(self.n_clusters))
self.columns_ = np.vstack(self.column_labels_ == c
for c in range(self.n_clusters))
return self


class SpectralBiclustering(BaseSpectral):
Expand Down Expand Up @@ -475,6 +492,7 @@ def _fit(self, X):
self.columns_ = np.vstack(self.column_labels_ == label
for _ in range(n_row_clusters)
for label in range(n_col_clusters))
return self

def _fit_best_piecewise(self, vectors, n_best, n_clusters):
"""Find the ``n_best`` vectors that are best approximated by piecewise
Expand Down
4 changes: 1 addition & 3 deletions sklearn/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
from sklearn.utils.testing import _named_check

import sklearn
from sklearn.cluster.bicluster import BiclusterMixin

from sklearn.linear_model.base import LinearClassifierMixin
from sklearn.utils.estimator_checks import (
Expand Down Expand Up @@ -63,8 +62,6 @@ def test_non_meta_estimators():
# input validation etc for non-meta estimators
estimators = all_estimators()
for name, Estimator in estimators:
if issubclass(Estimator, BiclusterMixin):
continue
if name.startswith("_"):
continue
for check in _yield_all_checks(name, Estimator):
Expand Down Expand Up @@ -214,6 +211,7 @@ def test_transformer_n_iter():
check_transformer_n_iter, name), name, estimator



def test_get_params_invariance():
# Test for estimators that support get_params, that
# get_params(deep=False) is a subset of get_params(deep=True)
Expand Down
14 changes: 11 additions & 3 deletions sklearn/utils/estimator_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,9 @@ def set_testing_parameters(estimator):
if "decision_function_shape" in params:
# SVC
estimator.set_params(decision_function_shape='ovo')

if "n_best" in params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, makes sense. Shouldn't it be set to that by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry could you clarify what should be set by default ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_best should be set to n_components by default, I think. Seems like a usability issue in the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. It seems right. Should I add this logic to the class then ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do it

# BiCluster
estimator.set_params(n_best=1)
if estimator.__class__.__name__ == "SelectFdr":
# be tolerant of noisy datasets (not actually speed)
estimator.set_params(alpha=.5)
Expand Down Expand Up @@ -335,6 +337,8 @@ def check_estimator_sparse_data(name, Estimator):
X_csr = sparse.csr_matrix(X)
y = (4 * rng.rand(40)).astype(np.int)
for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
if name == 'SpectralCoclustering':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It should raise an appropriate error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even here, similar to above scenario this call is failing. This is the stacktrace

sklearn/cluster/bicluster.py:298: RuntimeWarning: invalid value encountered in multiply
  z = np.vstack((row_diag[:, np.newaxis] * u,
Estimator SpectralCoclustering doesn't seem to fail gracefully on sparse data: it should raise a TypeError if sparse input is explicitly not supported.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sklearn/utils/estimator_checks.py", line 353, in check_estimator_sparse_data
    estimator.fit(X, y)
  File "sklearn/cluster/bicluster.py", line 136, in fit
    return self._fit(X)
  File "sklearn/cluster/bicluster.py", line 301, in _fit
    _, labels = self._k_means(z, self.n_clusters)
  File "sklearn/cluster/bicluster.py", line 189, in _k_means
    model.fit(data)
  File "sklearn/cluster/k_means_.py", line 880, in fit
    X = self._check_fit_data(X)
  File "sklearn/cluster/k_means_.py", line 855, in _check_fit_data
    X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
  File "sklearn/utils/validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "sklearn/utils/validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well is that a consequence of the data content or the data being sparse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is due to the data content rather than sparsity.

continue
X = X_csr.asformat(sparse_format)
# catch deprecation warnings
with ignore_warnings(category=DeprecationWarning):
Expand Down Expand Up @@ -684,7 +688,11 @@ def check_fit_score_takes_y(name, Estimator):
@ignore_warnings
def check_estimators_dtypes(name, Estimator):
rnd = np.random.RandomState(0)
X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32)
if name in ["BoxCoxTransformer", "SpectralCoclustering",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array being randomly generated contains 0s which is causing inf and nan which results in ValueError due to this. The only approach was to generate positive numbers. Is there something I am doing wrong here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoxCoxTransformer is not in master yet, right?
Why do zeros create inf an nan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yeah BoxCoxTransformer is not in master yet, but the tests were failing in my local machine when I was switching between the branches. will remove that. But in BiCluster module, the svd vectors u and v are failing finite tests when there are zeroes in X_train_32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the algorithm but that seems strange to me. Have you understood why this is happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even I am not familiar at all with the algorithm, but when the current code is executed, normalized_data here has NaN and fails finite check during this call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that's when an entire row is zero?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still the BoxCoxTransformer here. I guess we could leave it like this for now with a FIXME and open an issue. It's more important to me that the basic API is correct than having all the edge-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer

"SpectralBiclustering"]:
X_train_32 = 3 * rnd.uniform(1., 2., size=(20, 5)).astype(np.float32)
else:
X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32)
X_train_64 = X_train_32.astype(np.float64)
X_train_int_64 = X_train_32.astype(np.int64)
X_train_int_32 = X_train_32.astype(np.int32)
Expand Down Expand Up @@ -1309,7 +1317,7 @@ def check_class_weight_balanced_linear_classifier(name, Classifier):

@ignore_warnings(category=DeprecationWarning)
def check_estimators_overwrite_params(name, Estimator):
X, y = make_blobs(random_state=0, n_samples=9)
X, y = make_blobs(random_state=0, n_samples=9, n_features=3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bicluster classes, there is singular vector vt of shape [n_features, n_components] on which k_means is applied for which the default value for n_clusters is 3 which should be greater than equal to the number of samples of vt, which is n_features(which has default of 2 in make_blobs), hence changed it to 3. And the other estimators were still passing, so I hoped it would be okay.

y = multioutput_estimator_convert_y_2d(name, y)
# some want non-negative input
X -= X.min()
Expand Down
0