-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] MAINT: Return self for fit in Spectral Biclustering and CoClustering #6141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,10 +35,19 @@ def _scale_normalize(X): | |
|
||
""" | ||
X = make_nonnegative(X) | ||
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze() | ||
col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0))).squeeze() | ||
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))) | ||
if row_diag.shape[0] != 1: | ||
row_diag = row_diag.squeeze() | ||
|
||
col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0))) | ||
if col_diag.ndim == 1 and col_diag.shape[0] != 1: | ||
col_diag = col_diag.squeeze() | ||
if col_diag.ndim == 2 and col_diag.shape[1] != 1: | ||
col_diag = col_diag.squeeze() | ||
|
||
row_diag = np.where(np.isnan(row_diag), 0, row_diag) | ||
col_diag = np.where(np.isnan(col_diag), 0, col_diag) | ||
|
||
if issparse(X): | ||
n_rows, n_cols = X.shape | ||
r = dia_matrix((row_diag, [0]), shape=(n_rows, n_rows)) | ||
|
@@ -110,17 +119,21 @@ def _check_parameters(self): | |
" one of {1}.".format(self.svd_method, | ||
legal_svd_methods)) | ||
|
||
def fit(self, X): | ||
def fit(self, X, y=None): | ||
"""Creates a biclustering for X. | ||
|
||
Parameters | ||
---------- | ||
X : array-like, shape (n_samples, n_features) | ||
|
||
Returns | ||
------- | ||
self : object | ||
Returns the instance itself. | ||
""" | ||
X = check_array(X, accept_sparse='csr', dtype=np.float64) | ||
self._check_parameters() | ||
self._fit(X) | ||
return self._fit(X) | ||
|
||
def _svd(self, array, n_components, n_discard): | ||
"""Returns first `n_components` left and right singular | ||
|
@@ -156,6 +169,8 @@ def _svd(self, array, n_components, n_discard): | |
|
||
assert_all_finite(u) | ||
assert_all_finite(vt) | ||
if u.shape[1] == 1 and vt.shape[0] == 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, explain please? Can you also add a comment maybe? So their's only one row in X? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. checking number of samples in the beginning should fix this, too, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIU this is happening when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the algorithm make sense for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like the number of support vectors is increased here: https://github.com/scikit-learn/scikit-learn/blob/65a4de5/sklearn/cluster/bicluster.py#L442 so it should work, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, they were implemented by @kemaleren, and it looks like @larsmans attempted some cleanup, but I've had the impression for a while that they are somewhat orphaned, given how long issues like #2484 have been around. |
||
n_discard = 0 | ||
u = u[:, n_discard:] | ||
vt = vt[n_discard:] | ||
return u, vt.T | ||
|
@@ -278,6 +293,7 @@ def _fit(self, X): | |
normalized_data, row_diag, col_diag = _scale_normalize(X) | ||
n_sv = 1 + int(np.ceil(np.log2(self.n_clusters))) | ||
u, v = self._svd(normalized_data, n_sv, n_discard=1) | ||
3372 |
|
|
z = np.vstack((row_diag[:, np.newaxis] * u, | ||
col_diag[:, np.newaxis] * v)) | ||
|
||
|
@@ -291,6 +307,7 @@ def _fit(self, X): | |
for c in range(self.n_clusters)) | ||
self.columns_ = np.vstack(self.column_labels_ == c | ||
for c in range(self.n_clusters)) | ||
return self | ||
|
||
|
||
class SpectralBiclustering(BaseSpectral): | ||
|
@@ -475,6 +492,7 @@ def _fit(self, X): | |
self.columns_ = np.vstack(self.column_labels_ == label | ||
for _ in range(n_row_clusters) | ||
for label in range(n_col_clusters)) | ||
return self | ||
|
||
def _fit_best_piecewise(self, vectors, n_best, n_clusters): | ||
"""Find the ``n_best`` vectors that are best approximated by piecewise | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -288,7 +288,9 @@ def set_testing_parameters(estimator): | |
if "decision_function_shape" in params: | ||
# SVC | ||
estimator.set_params(decision_function_shape='ovo') | ||
|
||
if "n_best" in params: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, makes sense. Shouldn't it be set to that by default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry could you clarify what should be set by default ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarifying. It seems right. Should I add this logic to the class then ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure will do it |
||
# BiCluster | ||
estimator.set_params(n_best=1) | ||
if estimator.__class__.__name__ == "SelectFdr": | ||
# be tolerant of noisy datasets (not actually speed) | ||
estimator.set_params(alpha=.5) | ||
|
@@ -335,6 +337,8 @@ def check_estimator_sparse_data(name, Estimator): | |
X_csr = sparse.csr_matrix(X) | ||
y = (4 * rng.rand(40)).astype(np.int) | ||
for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']: | ||
if name == 'SpectralCoclustering': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? It should raise an appropriate error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even here, similar to above scenario this call is failing. This is the stacktrace
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well is that a consequence of the data content or the data being sparse? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems it is due to the data content rather than sparsity. |
||
continue | ||
X = X_csr.asformat(sparse_format) | ||
# catch deprecation warnings | ||
with ignore_warnings(category=DeprecationWarning): | ||
|
@@ -684,7 +688,11 @@ def check_fit_score_takes_y(name, Estimator): | |
@ignore_warnings | ||
def check_estimators_dtypes(name, Estimator): | ||
rnd = np.random.RandomState(0) | ||
X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32) | ||
if name in ["BoxCoxTransformer", "SpectralCoclustering", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The array being randomly generated contains 0s which is causing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BoxCoxTransformer is not in master yet, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry yeah BoxCoxTransformer is not in master yet, but the tests were failing in my local machine when I was switching between the branches. will remove that. But in BiCluster module, the svd vectors There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not very familiar with the algorithm but that seems strange to me. Have you understood why this is happening? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so that's when an entire row is zero? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's still the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah it is happening when an entire row is zero. But surprisingly happens only for X_train_32. Will remove boxcox transformer |
||
"SpectralBiclustering"]: | ||
X_train_32 = 3 * rnd.uniform(1., 2., size=(20, 5)).astype(np.float32) | ||
else: | ||
X_train_32 = 3 * rnd.uniform(size=(20, 5)).astype(np.float32) | ||
X_train_64 = X_train_32.astype(np.float64) | ||
X_train_int_64 = X_train_32.astype(np.int64) | ||
X_train_int_32 = X_train_32.astype(np.int32) | ||
|
@@ -1309,7 +1317,7 @@ def check_class_weight_balanced_linear_classifier(name, Classifier): | |
|
||
@ignore_warnings(category=DeprecationWarning) | ||
def check_estimators_overwrite_params(name, Estimator): | ||
X, y = make_blobs(random_state=0, n_samples=9) | ||
X, y = make_blobs(random_state=0, n_samples=9, n_features=3) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For bicluster classes, there is singular vector |
||
y = multioutput_estimator_convert_y_2d(name, y) | ||
# some want non-negative input | ||
X -= X.min() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, can you explain this change. It's been a while since I looked at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could run squeeze and then reshape. Or, probably better, make sure
min_samples=2
incheck_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe actually ensure
n_samples >= n_clusters
andn_clusters > 1
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I suppose this will take care of this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. are you still working on this? I'd like to get this into 0.12.1, which I'm planning to do over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amueller sorry for the delay. I will really try to get this into 0.18.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking into this, the failing common tests set n_clusters = 1. So if we ensure this, it might still need refactoring to the common tests. In particular, they are
check_fit2d_1sample
,check_fit1d_1feature
andcheck_fit1d_1sample
inestimator_checks
. What would be better to do here ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the tests seem to be odd. In particular the comment is wrong. We should allow estimators to raise the error that is raised by
check_X_y
withmin_samples
.