-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX Adjusts xi in test_extract_xi #14201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/cluster/optics_.py
Outdated
@@ -233,7 +233,7 @@ def fit(self, X, y=None): | |||
self : instance of OPTICS | |||
The instance. | |||
""" | |||
X = check_array(X, dtype=np.float) | |||
X = check_array(X, dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if something fails in 64bit due to numerical issues, I don't understand why it would not fail in 32bit. What's the advantage of doing this? I would say keeping f64 is always good as far as numerical precision is concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound like a good idea to me. Most estimators accept a float64, if not enforce it. Which means usually if the user wants the data not to be copied, they can feed a float64 data. This change forces a copy on all those data, other than loosing precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most CI jobs pass and we're unable to find any related bugs, so I guess it's not worthwhile to drop float64 support in order to make CIs green. Maybe increase the tolerance or construct another dataset?
Maybe if this is fixing the issue, we should better generate the data for the tests? |
sklearn/cluster/tests/test_optics.py
Outdated
@@ -97,6 +98,9 @@ def test_extract_xi(): | |||
-1, [4] * 5] | |||
X, expected_labels = shuffle(X, expected_labels, random_state=rng) | |||
|
|||
if _IS_32BIT: | |||
X = X.astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float64 should still work on 32bit arch. If it fails it may mean that something internally in OPTICS creates arrays of dtype np.int
(or np.intp
) instead of X.dtype
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time for a deep dive into OPTICS. I hope I have better vision when I surface.
Here is where the error is coming from: X = np.vstack((C1, C2, C3, C4, C5, np.array([[100, 100]] * 2), C6))
expected_labels = np.r_[[1] * 5, [3] * 5, [2] * 5, [0] * 5, [2] * 5,
-1, -1, [4] * 5]
X, expected_labels = shuffle(X, expected_labels, random_state=rng)
clust = OPTICS(min_samples=3, min_cluster_size=3,
max_eps=20, cluster_method='xi',
xi=0.1).fit(X)
# this may fail if the predecessor correction is not at work!
assert_array_equal(clust.labels_, expected_labels) which is failing on: E AssertionError:
E Arrays are not equal
E
E Mismatch: 18.8%
E Max absolute difference: 3
E Max relative difference: nan
E x: array([ 0, 0, -1, -1, 1, 3, 3, 2, 0, 3, 3, -1, 1, 1, -1, 2, -1,
E 4, 0, -1, 4, 0, 4, 2, -1, 1, 1, 4, 2, 3, 4, -1])
E y: array([ 0, 0, 2, 2, 1, 3, 3, 2, 0, 3, 3, 2, 1, 1, 2, 2, -1,
E 4, 0, 2, 4, 0, 4, 2, -1, 1, 1, 4, 2, 3, 4, 2]) Some of the clusters in |
In the following line: scikit-learn/sklearn/cluster/optics_.py Line 476 in f339609
the In 64 bit: reachability_[index][13] = 0.6762013074479917
reachability_[index][15] = 0.6762013074479917 and in 32 bit: reachability_[index][13] = 0.6762013074479917
reachability_[index][15] = 0.6762013074479916 This is why in 32bit, 15 is chosen to be the Edit:
|
@thomasjpfan that's probably because it's the nearest neighbor algorithm returning different values based on the dtype/architecture. I had seen this issue, and kinda solved it with changing the dataset. It doesn't look like an optics issue to me, i.e. this optics test is surfacing an issue in the neighbors algorithm, I think. |
This PR was reduced to only updating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a comfortable fix!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a comfortable fix!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating!
LGTM, thanks. |
Reference Issues/PRs
Fixes #13739
What does this implement/fix? Explain your changes.
Adjusts xi in test_extract_xi
Any other comments?
This test pass on i686 on scikit-learn-wheels configured to point to this branch.
Here is the diff of scikit-learn-wheels used to run the test.
cc @qinhanmin2014 @adrinjalali