8000 test_extract_xi failing on nightly builds · Issue #13739 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

test_extract_xi failing on nightly builds #13739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thomasjpfan opened this issue Apr 28, 2019 · 25 comments · Fixed by #14201
Closed

test_extract_xi failing on nightly builds #13739

thomasjpfan opened this issue Apr 28, 2019 · 25 comments · Fixed by #14201
Milestone

Comments

@thomasjpfan
Copy link
Member

test_extract_xi failing for MB_PYTHON_VERSION=3.6 PLAT=i686 and MB_PYTHON_VERSION=3.7 PLAT=i686

https://travis-ci.org/MacPython/scikit-learn-wheels/jobs/525429182
https://travis-ci.org/MacPython/scikit-learn-wheels/jobs/525429184

@jnothman
Copy link
Member

Some points are being labelled as outliers where they shouldn't be:

FAILURES ===================================
_______________________________ test_extract_xi ________________________________
    def test_extract_xi():
        # small and easy test (no clusters around other clusters)
        # but with a clear noise data.
        rng = np.random.RandomState(0)
        n_points_per_cluster = 5
    
        C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
        C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
        C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
        C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
        C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
        C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)
    
        X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]]), C6))
        expected_labels = np.r_[[2] * 5, [0] * 5, [1] * 5, [3] * 5, [1] * 5,
                                -1, [4] * 5]
        X, expected_labels = shuffle(X, expected_labels, random_state=rng)
    
        clust = OPTICS(min_samples=3, min_cluster_size=2,
                       max_eps=np.inf, cluster_method='xi',
                       xi=0.4).fit(X)
        assert_array_equal(clust.labels_, expected_labels)
    
        X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
        expected_labels = np.r_[[1] * 5, [3] * 5, [2] * 5, [0] * 5, [2] * 5,
                                -1, -1, [4] * 5]
        X, expected_labels = shuffle(X, expected_labels, random_state=rng)
    
        clust = OPTICS(min_samples=3, min_cluster_size=3,
                       max_eps=np.inf, cluster_method='xi',
                       xi=0.1).fit(X)
        # this may fail if the predecessor correction is not at work!
>       assert_array_equal(clust.labels_, expected_labels)
E       AssertionError: 
E       Arrays are not equal
E       
E       (mismatch 21.875%)
E        x: array([ 0,  0, -1, -1,  1, -1,  3,  2,  0,  3,  3, -1,  1,  1, -1,  2, -1,
E               4,  0, -1,  4,  0,  4,  2, -1,  1,  1,  4,  2,  3,  4, -1])
E        y: array([ 0,  0,  2,  2,  1,  3,  3,  2,  0,  3,  3,  2,  1,  1,  2,  2, -1,
E               4,  0,  2,  4,  0,  4,  2, -1,  1
8000
,  1,  4,  2,  3,  4,  2])
C1         = array([[-3.58875812, -1.67987423],
       [-4.21700961, -0.20728544],
       [-3.50595361, -2.7818223 ],
       [-4.23992927, -2.12108577],
       [-5.08257508, -1.6715212 ]])
C2         = array([[ 4.01440436, -0.85457265],
       [ 4.07610377, -0.9878325 ],
       [ 4.04438632, -0.96663257],
       [ 4.14940791, -1.02051583],
       [ 4.03130677, -1.08540957]])
C3         = array([[ 0.48940204, -1.86927628],
       [ 1.17288724, -2.148433  ],
       [ 1.45395092, -2.29087313],
       [ 1.0091517 , -2.03743677],
       [ 1.30655584, -1.70612825]])
C4         = array([[-1.95351577,  3.11344876],
       [-2.26633572,  2.40576106],
       [-2.10437364,  3.04690469],
       [-1.6309128 ,  3.36071395],
       [-2.11619805,  2.90930917]])
C5         = array([[ 2.37086822, -2.85201076],
       [ 1.97623789, -0.82953476],
       [ 2.69420869, -2.26284458],
       [ 2.24832278, -1.53350579],
       [ 2.03166129, -2.12764417]])
C6         = array([[ 4.82090669,  6.0773805 ],
       [ 4.89783897,  5.76387356],
       [ 4.99436355,  6.08566637],
       [ 5.01330344,  6.06049438],
       [ 4.87313558,  5.92745177]])
X          = array([[  -2.10437364,    3.04690469],
       [  -1.95351577,    3.11344876],
       [   2.24832278,   -1.53350579],
 ...,
       [   4.14940791,   -1.02051583],
       [   4.89783897,    5.76387356],
       [   2.03166129,   -2.12764417]])
clust      = OPTICS(algorithm='auto', cluster_method='xi', eps=None, leaf_size=30,
       max_eps=inf, metric='minkowski', metric_params=None, min_cluster_size=3,
       min_samples=3, n_jobs=None, p=2, predecessor_correction=True, xi=0.1)
expected_labels = array([ 0,  0,  2,  2,  1,  3,  3,  2,  0,  3,  3,  2,  1,  1,  2,  2, -1,
        4,  0,  2,  4,  0,  4,  2, -1,  1,  1,  4,  2,  3,  4,  2])
n_points_per_cluster = 5
rng        = <mtrand.RandomState object at 0xe9f3234c>

Ping @adrinjalali @qinhanmin2014 @kno10

@adrinjalali
Copy link
Member

I can't really diagnose this one w/o a Mac machine, which I don't have. But this is the test which took us a while to fix on the PR, and was fixed after the predecessor correction was finally fixed. Now it is not failing because of that (the predecessor correction is doing it's job according to the log), but it may be some numeric issue or something, hmm...

@qinhanmin2014
Copy link
Member

Seems that the test is not robust (i.e., the test will fail if we change the random_state), e.g.,

rng = np.random.RandomState(0)
n_points_per_cluster = 5
C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
expected_labels = np.r_[[1] * 5, [3] * 5, [2] * 5, [0] * 5, [2] * 5,
                        -1, -1, [4] * 5]
clust = OPTICS(min_samples=3, min_cluster_size=3,
               max_eps=np.inf, cluster_method='xi',
               xi=0.1).fit(X)
print(clust.labels_)
# [ 0  0  0  0  0 -1  3  3  3  3  2  2  2  2  2  1  1  1 -1  1  2  2  2  2
#   2  4  4  4  4  4  4  4]

@jnothman
Copy link
Member

So you mean that it is not invariant to permuting the data? How does "predecessor correction" relate to this order invariance? Should we be testing in a separate test that the result is order invariant most of the time, and somehow make this one a more stable test of predecessor correction?

@jnothman
Copy link
Member

Also, the problem is different between your example and the one that's failing here. The failing case is identifying several inliers as outliers.

@kno10
Copy link
Contributor
kno10 commented Apr 29, 2019

OPTICS plots is essentially a linearized version of a spanning tree. This linearization is not unique; it is easy to imagine that any merged subtrees can be swapped. But since the extraction is based on this linearization, it is to be expected that there exists some order dependence in particular for border points and outliers.
Thus I am not surprised that a different random generator can cause such a problem.

I'd suggest to fix the permutation, not the random generator. I.e., dump the permutation produced by shuffle, and put it as an array into the unit test, which would supposedly be like this:

shuffle = [11, 22, 10,  2, 16, 14, 28, 26, 20, 13, 24, 
           5, 17,  8, 30, 25, 23,1, 31,  6,  4, 18,
           29, 19,  9,  7, 27,  3,  0, 21, 15, 12]
X, expected_labels = X[shuffle], expected_labels[shuffle]

Then the test will no longer depend on the random generator.

@jnothman
Copy link
Member

The shuffling order cannot be the reason this test is failing on some platforms, since numpy's random number generator should be more-or-less cross platform (when generating small ints, certainly).

Numeric issues are a more likely culprit. But without having a hack on that platform, it's hard to work out where they may come from.

jnothman added a commit to jnothman/scikit-learn that referenced this issue Apr 29, 2019
@kno10
Copy link
Contributor
kno10 commented Apr 29, 2019

You are right, it shouldn't be the shuffling order, as the labels mostly agree. If it were different shuffling, they should be completely mixed up. But the points are well separated in a well-behaved numerical range.

I had seen that the first assert_array_equal worked, so I thought it was maybe the shuffle; but the first run was also shuffled, the second one only has an additional duplicate point; and both "outliers" seem appear to be correctly labeled.

I guess someone would need to look at the actual reachabilities and core sizes for the points on the two platforms, and then at which points are considered steep by the xi method.

@adrinjalali
Copy link
Member

How does "predecessor correction" relate to this order invariance?

I put a comment above that test saying that this test may fail if the predecessor correction is not at work. What I meant by that was that the test fails to find the outliers as outliers, only under certain permutations of the data, if predecessor correction is not done right.

Numeric issues are a more likely culprit. But without having a hack on that platform, it's hard to work out where they may come from.

+1

Then the test will no longer depend on the random generator.

When generating the data, sometimes around the edges of the cluster, the data "looks"/"feels" more sparse. That sometimes results in OPTICS detecting a cluster in the middle of that cluster, and another "parent" cluster which includes those edge data points, in which case, those edge data points are labeled as -1 since there's a smaller cluster detected by OPTICS inside that larger cluster; but if we look at the cluster hierarchy, they're indeed a part of a cluster and not detected as "outliers".

That said, the cluster hierarchies are still sensitive to the random seed, to somewhat an extreme extent. For instance, the following code (the hyperparameters are somewhat tuned):

from sklearn.cluster import OPTICS
from sklearn.utils import shuffle
import numpy as np
from sklearn.metrics import v_measure_score


def is_stable(min_samples, min_cluster_size, xi):
    min_score, max_score = 1, 0
    outliers_detected = True
    for seed in range(20):
        rng = np.random.RandomState(seed)
        n_points_per_cluster = 50

        C1 = [-5, -2] + .8 * rng.randn(n_points_per_cluster, 2)
        C2 = [4, -1] + .1 * rng.randn(n_points_per_cluster, 2)
        C3 = [1, -2] + .2 * rng.randn(n_points_per_cluster, 2)
        C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
        C5 = [3, -2] + .6 * rng.randn(n_points_per_cluster, 2)
        C6 = [5, 6] + .2 * rng.randn(n_points_per_cluster, 2)

        X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
        expected_labels = np.r_[[1] * 50, [3] * 50, [2] * 50, [0] * 50, [2] * 50,
                                -1, -1, [4] * 50]
        X, expected_labels = shuffle(X, expected_labels, random_state=rng)
        clust = OPTICS(min_samples=min_samples, min_cluster_size=min_cluster_size,
                       max_eps=np.inf, cluster_method='xi',
                       xi=xi, predecessor_correction=True).fit(X)
        score = v_measure_score(expected_labels, clust.labels_)
        min_score = min(min_score, score)
        max_score = max(max_score, score)

        if np.any(clust.labels_[expected_labels == -1] != -1):
            outliers_detected = False

    return min_score, max_score, outliers_detected


for min_samples in [5, 10]:
    print(min_samples)
    for min_cluster_size in [20]:
        for xi in [0.2, 0.25, 0.3, 0.35, 0.4]:
            min_score, max_score, outliers_detected = is_stable(
                min_samples, min_cluster_size, xi)
            print(min_samples, min_cluster_size, xi, min_score,
                  max_score, outliers_detected)

will output:

5
5 20 0.2 0.7539717725927957 0.9238319245670366 True
5 20 0.25 0.8030908150869507 0.9896449414305457 True
5 20 0.3 0.8445237371914895 0.9896449414305457 True
5 20 0.35 0.8765544810631446 0.9896449414305457 True
5 20 0.4 0.8438171811403292 0.9896449414305457 True
10
10 20 0.2 0.8523339691262533 0.9514079330881974 True
10 20 0.25 0.8703671103732291 0.9896449414305457 True
10 20 0.3 0.8733612366196282 0.9896449414305457 True
10 20 0.35 0.8438171811403292 0.9896449414305457 True
10 20 0.4 0.8447076457657651 0.9896449414305457 True

The good thing is the outliers are always detected correctly. However, change the range(20) to range(50), and you'll get:

5
5 20 0.2 0.7539717725927957 0.9241226358431036 False
5 20 0.25 0.8030908150869507 0.9896449414305457 False
5 20 0.3 0.8438171811403292 0.9896449414305457 False
5 20 0.35 0.8438171811403292 0.9896449414305457 False
5 20 0.4 0.8438171811403292 0.9896449414305457 False
10
10 20 0.2 0.8337238308523841 0.9514079330881974 False
10 20 0.25 0.8671112885340639 0.9896449414305457 False
10 20 0.3 0.8438171811403292 0.9896449414305457 False
10 20 0.35 0.8438171811403292 0.9896449414305457 False
10 20 0.4 0.8438171811403292 0.9896449414305457 False

i.e. there's always some circumstances under which the outliers are not detected as outliers.

We can change the test to check for a high v_measure_score and to check if the outliers are properly detected, and t 8000 hen later figure out why the issuehappens and when.

@jnothman
Copy link
Member
jnothman commented Apr 29, 2019 via email

@kno10
Copy link
Contributor
kno10 commented Apr 29, 2019

The v-measure (nor any of the other standard measures, unfortunately) on clustering data with a logical hierarchy and with outliers does not say much about quality anymore, because the measure does neither understand nested clusters not outliers encoded in the result, but rather they assume they evaluate a flat and complete clustering.
For example outliers are not a cluster, they may be completely unrelated to each other.
So you may find a result to intuitively much better that scores worse. People need to stop being obsessed with these scores, this only leads to overfitting and false discoveries.
It's okay for the purpose of regression testing though, the value should not change by unrelated code changes.

@jnothman
Copy link
Member
jnothman commented Apr 29, 2019 via email

@adrinjalali
Copy link
Member

Should we be disabling this test temporarily to get the release out?

yes, cause whatever we do now is to workaround the issue anyway. I rather disable it, have an issue for it, and then fix it later.

@qinhanmin2014
Copy link
Member

I agree that it's reasonable to get different results when we shuffle the dataset in a different way.
Regarding current test failure, I guess the reason might be some numeric issues when calculating the distances.
Since the failure only occurs on i686, maybe we can skip the test and release (is it possible to only skip the test on i686?).

@jnothman
Copy link
Member
jnothman commented Apr 29, 2019 via email

@qinhanmin2014
Copy link
Member

Hmm I have another question related to outliers @adrinjalali
2019-04-29_232731
I think r(x)<r(sD) in 4c should be r(x)>r(sD)? Seems that R dbscan is using r(x)>r(sD)?

# Condition 4
{
  # Case b
  if (sda$maximum * object$ixi >= sua$maximum) {
    while(cstart < cend && object$ord_rd[cstart+1] > sua$maximum) cstart <- cstart + 1
  }
  # Case c
  else if (sua$maximum * object$ixi >= sda$maximum) {
    while(cend > cstart && object$ord_rd[cend-1] > sda$maximum) cend <- cend - 1
  }
}

@qinhanmin2014
Copy link
Member

Seems that ELKI is using another version, but similar to r(x)>r(sD).

// However, we sometimes have to adjust this (Condition 4):
{
// Case b)
if(sda.maximum * ixi >= eU) {
  while(cstart < cend && reach.doubleValue(iter.seek(cstart + 1)) * ixi > eU) {
    cstart++;
  }
}
// Case c)
else if(eU * ixi >= sda.maximum) {
  while(cend > cstart && reach.doubleValue(iter.seek(cend)) * ixi > sda.maximum) {
    cend--;
  }
}
// Case a) is the default
}

@amueller
Copy link
Member

@qinhanmin2014 So you're saying there's a typo in the original paper and we're implementing the typo?

@qinhanmin2014
Copy link
Member

@qinhanmin2014 So you're saying there's a typo in the original paper and we're implementing the typo?

I think so, not sure whether I'm wrong.

@jnothman
Copy link
Member

@adrinjalali, I got misled by your:

I can't really diagnose this one w/o a Mac machine, which I don't have

The failure is on a linux i686.

@adrinjalali
Copy link
Member

The failure is on a linux i686.

🤦‍♂️ sorry. I was misled by the MacPython in the url of the build log.

@jnothman
Copy link
Member

🤦‍♂ sorry. I was misled by the MacPython in the url of the build log.

Yeah, I find it a really hard cognitive dissonance to deal with too!

@kno10
Copy link
Contributor
kno10 commented Apr 30, 2019

@qinhanmin2014 the R version is a port of an older ELKI version, so it is likely to reproduce any error in the ELKI implementation at that time...

This is obviously a typo in the paper. Since the steep up region is monotone, the minimum with < would always be sU, the start of the steep-up region. With a > this is consistent with the first case.

@kno10
< 628C path d="M8 9a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3ZM1.5 9a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3Zm13 0a1.5 1.5 0 1 0 0-3 1.5 1.5 0 0 0 0 3Z"> Copy link
Contributor
kno10 commented Apr 30, 2019

@jnothman you may want to retitle #13744 then to reduce the confusion.

@qinhanmin2014
Copy link
Member

This is obviously a typo in the paper. Since the steep up region is monotone, the minimum with < would always be sU, the start of the steep-up region. With a > this is consistent with the first case.

Thanks, we'll correct it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0