-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
test_extract_xi failing on nightly builds #13739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some points are being labelled as outliers where they shouldn't be:
|
I can't really diagnose this one w/o a Mac machine, which I don't have. But this is the test which took us a while to fix on the PR, and was fixed after the predecessor correction was finally fixed. Now it is not failing because of that (the predecessor correction is doing it's job according to the log), but it may be some numeric issue or something, hmm... |
Seems that the test is not robust (i.e., the test will fail if we change the random_state), e.g.,
|
So you mean that it is not invariant to permuting the data? How does "predecessor correction" relate to this order invariance? Should we be testing in a separate test that the result is order invariant most of the time, and somehow make this one a more stable test of predecessor correction? |
Also, the problem is different between your example and the one that's failing here. The failing case is identifying several inliers as outliers. |
OPTICS plots is essentially a linearized version of a spanning tree. This linearization is not unique; it is easy to imagine that any merged subtrees can be swapped. But since the extraction is based on this linearization, it is to be expected that there exists some order dependence in particular for border points and outliers. I'd suggest to fix the permutation, not the random generator. I.e., dump the permutation produced by
Then the test will no longer depend on the random generator. |
The shuffling order cannot be the reason this test is failing on some platforms, since numpy's random number generator should be more-or-less cross platform (when generating small ints, certainly). Numeric issues are a more likely culprit. But without having a hack on that platform, it's hard to work out where they may come from. |
You are right, it shouldn't be the shuffling order, as the labels mostly agree. If it were different shuffling, they should be completely mixed up. But the points are well separated in a well-behaved numerical range. I had seen that the first I guess someone would need to look at the actual reachabilities and core sizes for the points on the two platforms, and then at which points are considered steep by the xi method. |
I put a comment above that test saying that this test may fail if the predecessor correction is not at work. What I meant by that was that the test fails to find the outliers as outliers, only under certain permutations of the data, if predecessor correction is not done right.
+1
When generating the data, sometimes around the edges of the cluster, the data "looks"/"feels" more sparse. That sometimes results in OPTICS detecting a cluster in the middle of that cluster, and another "parent" cluster which includes those edge data points, in which case, those edge data points are labeled as -1 since there's a smaller cluster detected by OPTICS inside that larger cluster; but if we look at the cluster hierarchy, they're indeed a part of a cluster and not detected as "outliers". That said, the cluster hierarchies are still sensitive to the random seed, to somewhat an extreme extent. For instance, the following code (the hyperparameters are somewhat tuned): from sklearn.cluster import OPTICS
from sklearn.utils import shuffle
import numpy as np
from sklearn.metrics import v_measure_score
def is_stable(min_samples, min_cluster_size, xi):
min_score, max_score = 1, 0
outliers_detected = True
for seed in range(20):
rng = np.random.RandomState(seed)
n_points_per_cluster = 50
C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
expected_labels = np.r_[[1] * 50, [3] * 50, [2] * 50, [0] * 50, [2] * 50,
-1, -1, [4] * 50]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)
clust = OPTICS(min_samples=min_samples, min_cluster_size=min_cluster_size,
max_eps=np.inf, cluster_method='xi',
xi=xi, predecessor_correction=True).fit(X)
score = v_measure_score(expected_labels, clust.labels_)
min_score = min(min_score, score)
max_score = max(max_score, score)
if np.any(clust.labels_[expected_labels == -1] != -1):
outliers_detected = False
return min_score, max_score, outliers_detected
for min_samples in [5, 10]:
print(min_samples)
for min_cluster_size in [20]:
for xi in [0.2, 0.25, 0.3, 0.35, 0.4]:
min_score, max_score, outliers_detected = is_stable(
min_samples, min_cluster_size, xi)
print(min_samples, min_cluster_size, xi, min_score,
max_score, outliers_detected) will output:
The good thing is the outliers are always detected correctly. However, change the
i.e. there's always some circumstances under which the outliers are not detected as outliers. We can change the test to check for a high |
Well I don't think in the case of the current failure you'd get high
v-measure. Checking that the outliers are correctly detected (and that
there are non-outlier clusters) might be a reasonable test of predecessor
correction but not other things.
The hypothesis that it is due to a small cluster being detected spuriously
on this platform seems reasonable. Can you see candidate troughs in the
reachability plot that are on the cusp of being a cluster? Can we hack the
data to make it less platform-sensitive?
|
The v-measure (nor any of the other standard measures, unfortunately) on clustering data with a logical hierarchy and with outliers does not say much about quality anymore, because the measure does neither understand nested clusters not outliers encoded in the result, but rather they assume they evaluate a flat and complete clustering. |
Should we be disabling this test temporarily to get the release out?
|
yes, cause whatever we do now is to workaround the issue anyway. I rather disable it, have an issue for it, and then fix it later. |
I agree that it's reasonable to get different results when we shuffle the dataset in a different way. |
Let's put the skip in the 0.21.X branch for now. I'm branching.
|
Hmm I have another question related to outliers @adrinjalali
|
Seems that ELKI is using another version, but similar to
|
@qinhanmin2014 So you're saying there's a typo in the original paper and we're implementing the typo? |
I think so, not sure whether I'm wrong. |
@adrinjalali, I got misled by your:
The failure is on a linux i686. |
🤦♂️ sorry. I was misled by the |
Yeah, I find it a really hard cognitive dissonance to deal with too! |
@qinhanmin2014 the R version is a port of an older ELKI version, so it is likely to reproduce any error in the ELKI implementation at that time... This is obviously a typo in the paper. Since the steep up region is monotone, the minimum with < would always be sU, the start of the steep-up region. With a > this is consistent with the first case. |
Thanks, we'll correct it. |
test_extract_xi
failing forMB_PYTHON_VERSION=3.6 PLAT=i686
andMB_PYTHON_VERSION=3.7 PLAT=i686
https://travis-ci.org/MacPython/scikit-learn-wheels/jobs/525429182
https://travis-ci.org/MacPython/scikit-learn-wheels/jobs/525429184
The text was updated successfully, but these errors were encountered: