-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Detect precision issues in euclidean distance calculations #12142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Surprisingly bad precision, isn't it? Note that the traditional computation sqrt(sum((x-y)**2)) gets the results exact.
warning_message += ( | ||
"Consider standardizing features by removing the mean, " | ||
"or setting globally " | ||
"euclidean_distances_algorithm='exact' for slower but " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not just back off to using the exact algorithm in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would probably be the best outcome. But I would like to explore more fine-grained detection of problematic points, instead of considering just the min/max of the distribution for this to actually be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standardizing, centering etc. will not help in all cases. Consider the 1d data set -10001,-10000,10000,10001 with FP32; essentially the symmetric version of the example I provided before. The mean is zero; and scaling it will not resolve the relative closeness of the pairs of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, sample statistics is not the solution here. This was merely an attempt at prototyping. Theoretically though since we already have the norms, we could use comparison between the norms as a proxy to determine potentially problematic points at the cost of O(n_samples_a*n_samples_b)
as compared to the overall cost O(n_samples_a*n_samples_b*n_features)
for the full calculation. Or less if we use some tree structure. Then recompute those points exactly, though it would have its own limitations...
Yes, I've realised that coarse-grained detection may lead to subset
invariance issues (i.e. taking the pairwise distances over a subset of the
data may result in a different method and hence different results).
|
Some additional good values for testing implementations include (note that these aim at a different thing - at causing the distances to become negative due to rounding - they are derived from test values for verifying variance implementations, and the values are 1 ULP apart, while the numeric accuracy of the dot-based equation used here is supposedly sqrt(1 ULP)):
These will cause havoc to a naive implementation:
Notice the NaNs:
The current released code should pass this test because of the |
Happy to close? |
Agreed, closing following discussion in #9354 (comment) |
Following the discussion in #9354
this is an attempt to warn the user when numerical precision issues may occur in the euclidean distances computed with quadratic expansion. This affects both 32 bit and 64bit.
The overall idea is that we need to detect when the distance between two vectors is much less then the vector norms (by a factor of 1e7 in 64bit).
Here we take a more simplified approach and instead only consider the global distribution of compared vectors. If all considered vectors are within a very thin shell such a the shell thinkness divided by the mean vector norm is below a given threshold, a warning is raised (cf image below).
[[100000001, 100000000]]
and[[100000000, 100000001]]
, as reported by @kno10 in Numerical precision of euclidean_distances with float32 #9354 (comment)Here the cost of checking for precision issues is only
O(n_samples_a + n_samples_b)
which is negligible to the co 10000 st of computing pairwise distancesO(n_samples_a*n_samples_b*n_features)
.This it's more a proof of concept, more work would be needed to make this useful.
TODO
BallTree.query_radius
), then compute them accurately. The main idea is that comparing only vector norms, may give an idea when precision is problematic, with low computational effort, even though there are false positives.