-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] set blockwise diagonals to zero for euclidean distance #12612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] set blockwise diagonals to zero for euclidean distance #12612
Conversation
@@ -473,7 +473,7 @@ def check_pairwise_distances_chunked(X, Y, working_memory, metric='euclidean'): | |||
metric=metric) | |||
assert isinstance(gen, GeneratorType) | |||
blockwise_distances = list(gen) | |||
Y = np.array(X if Y is None else Y) | |||
Y = X if Y is None else Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is necessary otherwise the zeroing logic in pairwise_distances
doesn't kick in.
sklearn/metrics/pairwise.py
Outdated
# zeroing diagonal, taking care of aliases of "euclidean", | ||
# i.e. "l2" | ||
D_chunk[np.arange(D_chunk.shape[0]), | ||
np.arange(D_chunk.shape[1])[sl]] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely there's a better way to force inner indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do D_chunk.flat[offset::len(X)+1] = 0
?
sklearn/metrics/pairwise.py
Outdated
# zeroing diagonal, taking care of aliases of "euclidean", | ||
# i.e. "l2" | ||
D_chunk[np.arange(D_chunk.shape[0]), | ||
np.arange(D_chunk.shape[1])[sl]] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do D_chunk.flat[offset::len(X)+1] = 0
?
@@ -1271,6 +1271,13 @@ def pairwise_distances_chunked(X, Y=None, reduce_func=None, | |||
X_chunk = X[sl] | |||
D_chunk = pairwise_distances(X_chunk, Y, metric=metric, | |||
n_jobs=n_jobs, **kwds) | |||
if ((X is Y or Y is None) | |||
and PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None) | |||
is euclidean_distances): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be introducing this for all metrics, perhaps with a FutureWarning where we detect the diagonal is not close to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both chunked and non-chunked? How do we know the callable we get passed is actually a metric? Should we try and detect and warn if it's not? Maybe the deprecation will be enough of a warning.
Do you think this should go into 0.20.1? |
I think this limited change should go into 0.20.1 as it fixes a regression
for brute neighbours and estimate_bandwidth.
I think we should expect that all metrics used with pairwise_metrics return
0 on the diagonal. We should deprecate support for metrics that do not.
(And we should consider whether we need a kernel/affinity equivalent of
pairwise_distances_chunked)
|
Agree on all accounts. Need to fix the indexing and we need a second review ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's new entry for the regression?
chunks = list(pairwise_distances_chunked(X, working_memory=1, | ||
metric=metric)) | ||
assert len(chunks) > 1 | ||
assert_array_almost_equal(np.diag(np.vstack(chunks)), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is almost equal good enough? Set a very small tolerance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering. this has "decimal=6" by default. What is small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's small enough :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decimal=10
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, merging.
Thanks! |
Oops, I was a bit too quick, I thought @jnothman had given his +1 but it's not the case. All comments seemed addressed though. |
whoops, you're right :-/ |
No.
It has to :) If not we need a more detailed feedback from someone with a machine that can reproduce the error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub won't let me approve posthumously.
…cikit-learn#12612)" This reverts commit 5a3ea57.
…cikit-learn#12612)" This reverts commit 5a3ea57.
A shot at the discussion in #12574.
Not sure this is the right solution also the indexing looks weird to me.