8000 FIX Adjusts xi in test_extract_xi by thomasjpfan · Pull Request #14201 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX Adjusts xi in test_extract_xi #14201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 1, 2019
Merged

Conversation

thomasjpfan
Copy link
Member
@thomasjpfan thomasjpfan commented Jun 27, 2019

Reference Issues/PRs

Fixes #13739

What does this implement/fix? Explain your changes.

Adjusts xi in test_extract_xi

Any other comments?

This test pass on i686 on scikit-learn-wheels configured to point to this branch.

Here is the diff of scikit-learn-wheels used to run the test.

cc @qinhanmin2014 @adrinjalali

@@ -233,7 +233,7 @@ def fit(self, X, y=None):
self : instance of OPTICS
The instance.
"""
X = check_array(X, dtype=np.float)
X = check_array(X, dtype=np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if something fails in 64bit due to numerical issues, I don't understand why it would not fail in 32bit. What's the advantage of doing this? I would say keeping f64 is always good as far as numerical precision is concerned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound like a good idea to me. Most estimators accept a float64, if not enforce it. Which means usually if the user wants the data not to be copied, they can feed a float64 data. This change forces a copy on all those data, other than loosing precision.

Copy link
Member
@qinhanmin2014 qinhanmin2014 Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most CI jobs pass and we're unable to find any related bugs, so I guess it's not worthwhile to drop float64 support in order to make CIs green. Maybe increase the tolerance or construct another dataset?

@adrinjalali
Copy link
Member

Maybe if this is fixing the issue, we should better generate the data for the tests?

@@ -97,6 +98,9 @@ def test_extract_xi():
-1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)

if _IS_32BIT:
X = X.astype(np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float64 should still work on 32bit arch. If it fails it may mean that something internally in OPTICS creates arrays of dtype np.int (or np.intp) instead of X.dtype...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time for a deep dive into OPTICS. I hope I have better vision when I surface.

@thomasjpfan
Copy link
Member Author
thomasjpfan commented Jun 27, 2019

Here is where the error is coming from:

X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
expected_labels = np.r_[[1] * 5, [3] * 5, [2] * 5, [0] * 5, [2] * 5,
                        -1, -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)

clust = OPTICS(min_samples=3, min_cluster_size=3,
               max_eps=20, cluster_method='xi',
               xi=0.1).fit(X)
# this may fail if the predecessor correction is not at work!
assert_array_equal(clust.labels_, expected_labels)

which is failing on:

E       AssertionError: 
E       Arrays are not equal
E       
E       Mismatch: 18.8%
E       Max absolute difference: 3
E       Max relative difference: nan
E        x: array([ 0,  0, -1, -1,  1,  3,  3,  2,  0,  3,  3, -1,  1,  1, -1,  2, -1,
E               4,  0, -1,  4,  0,  4,  2, -1,  1,  1,  4,  2,  3,  4, -1])
E        y: array([ 0,  0,  2,  2,  1,  3,  3,  2,  0,  3,  3,  2,  1,  1,  2,  2, -1,
E               4,  0,  2,  4,  0,  4,  2, -1,  1,  1,  4,  2,  3,  4,  2])

Some of the clusters in 2 are marked with -1.

@thomasjpfan
Copy link
Member Author
thomasjpfan commented Jun 27, 2019

In the following line:

point = index[np.argmin(reachability_[index])]

the np.argmin(reachability_[index]) will start to diverge between 32 bit and 64 bit. At that point, in 32 bit, np.argmin(reachability_[index]) is 15, and in 64 bit it is 13.

In 64 bit:

reachability_[index][13] = 0.6762013074479917
reachability_[index][15] = 0.6762013074479917

and in 32 bit:

reachability_[index][13] = 0.6762013074479917
reachability_[index][15] = 0.6762013074479916

This is why in 32bit, 15 is chosen to be the argmin.

Edit:

  1. This happened because the core_distances_ between 32 and 64 bit is slightly off for point_index 9.
  2. This means _compute_core_distances_ is slightly different between 32 and 64 bit.

@adrinjalali
Copy link
Member

@thomasjpfan that's probably because it's the nearest neighbor algorithm returning different values based on the dtype/architecture.

I had seen this issue, and kinda solved it with changing the dataset. It doesn't look like an optics issue to me, i.e. this optics test is surfacing an issue in the neighbors algorithm, I think.

@thomasjpfan thomasjpfan changed the title [MRG] Forces float32 in optics [WIP] Forces float32 in optics Jun 28, 2019
@jnothman jnothman added this to the 0.21.3 milestone Jul 1, 2019
@thomasjpfan thomasjpfan changed the title [WIP] Forces float32 in optics [MRG] Adjusts xi in test_extract_xi Jul 1, 2019
@thomasjpfan
Copy link
Member Author

This PR was reduced to only updating the xi parameter. This seems to work: https://travis-ci.org/thomasjpfan/scikit-learn-wheels/builds/552555734

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a comfortable fix!!

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a comfortable fix!!

Copy 8000 link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating!

@rth rth changed the title [MRG] Adjusts xi in test_extract_xi FIX Adjusts xi in test_extract_xi Jul 1, 2019
@rth rth merged commit bd2cc10 into scikit-learn:master Jul 1, 2019
@qinhanmin2014
Copy link
Member

LGTM, thanks.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_extract_xi failing on nightly builds
5 participants
0