-
Notifications
You must be signed in to change notification settings - Fork 1
Add RadiusNeighborhood as a PairwiseDistancesReduction #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RadiusNeighborhood as a PairwiseDistancesReduction #3
Conversation
Necessarily casting because APIs exposed via scipy.linalg.cython_blas aren't reflecting the const-identifier for arguments
This creates a double freeing as the buffers are tight to vectors attached to the instance. When the RadiusNeighborhood instance gets deleted, so do the buffers and the content or returned numpy array makes no sense. Also it crashes when ones deallocates the numpy arrays. One needs to have the final buffers life-time no be tight to the instance's vectors.
To solve the problem given above, we allocate dynamically allocate vectors which won't be freed, but their buffer eventually will as the ownership will be transferred to numpy arrays.
So that buffers are recreated and not freed.
Could you check that this does not cause a memory leak by calling RadiusNeighbors' query method in a loop, discarding the results each time and monitorying Try a big enough test set with k large enough such that the resulting neighbors data is around 1MB in memory for instance. The training set does not need to be large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! Here is a few feedback:
Thanks for the review! Note that as of now, it's still WIP with some not so nice tactical advances to see if it can be functional before making it clean without too much difficulty, shall this be made functional. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
It gets automatically called then.
This reworks the synchronisation out of the first "with nogil, parallel"-context allowing more flexibility. For instance, this allow removing the lock previously needed for ArgKmin's synchronisation as we now parallelize on samples. This add a simple merge of vectors which should not work.
Use std::move to copy data from temporary vector to the main ones.
Also add comments.
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Introduce PairwiseDistancesReduction.is_usable_for encapsulating the test for branching logic. Also removed unused code due to new implementation.
Edit: cd0634e introduced proper memory management. Quick report regarding memory usage: As of b8fe6e1, It looks like some data isn't deleted after calls to It looks like numpy arrays do not free their buffer, one can attest this applying this patch: diff --git a/sklearn/metrics/_parallel_reductions.pyx b/sklearn/metrics/_parallel_reductions.pyx
index 57f3146664..95b60b2cd6 100644
--- a/sklearn/metrics/_parallel_reductions.pyx
+++ b/sklearn/metrics/_parallel_reductions.pyx
@@ -1052,3 +1052,6 @@ cdef class RadiusNeighborhood(PairwiseDistancesReduction):
# by a numpy array then.
del self.neigh_distances
return _coerce_vectors_to_np_nd_arrays(self.neigh_indices)
+
+ def indices(self):
+ return np.asarray(deref(self.neigh_indices))
\ No newline at end of file and seeing that In [1]: %run script_given_bellow.py
In [2]: dists, indices = rn.compute(return_distance=return_distance, sort_results=True, strategy="parallel_
...: on_Y")
In [3]: indices
Out[3]:
array([array([4, 2, 5, 3, 0, 1, 6, 7, 8, 9]),
array([6, 2, 5, 9, 1, 4, 3, 0, 8, 7]),
array([4, 0, 9, 1, 3, 2, 7, 5, 6, 8]), ...,
array([6, 8, 0, 3, 7, 9, 4, 1, 2, 5]),
array([6, 1, 4, 2, 0, 7, 8, 3, 5, 9]),
array([8, 9, 1, 6, 5, 3, 4, 7, 0, 2])], dtype=object)
In [4]: del indices
In [5]: rn.indices()
Out[5]:
array([[4, 2, 5, ..., 7, 8, 9],
[6, 2, 5, ..., 0, 8, 7],
[4, 0, 9, ..., 5, 6, 8],
...,
[6, 8, 0, ..., 1, 2, 5],
[6, 1, 4, ..., 3, 5, 9],
[8, 9, 1, ..., 7, 0, 2]]) Probably numpy's memory allocator is different and proper handling is needed as suggested by this response. Scriptfrom sklearn.metrics._parallel_reductions import RadiusNeighborhood
import numpy as np
import psutil
import pandas as pd
import gc
import matplotlib.pyplot as plt
if __name__ == "__main__":
memory_info = []
mem_info = psutil.Process().memory_full_info()
memory_info.append(mem_info)
rng = np.random.RandomState(0)
X = rng.rand(10000, 100)
Y = rng.rand(10, 100)
mem_info = psutil.Process().memory_full_info()
memory_info.append(mem_info)
n_calls = 100
radius = 10000
return_distance = True
for i in range(n_calls):
rn = RadiusNeighborhood.get_for(X, Y, radius=radius)
# Results are discarded
rn.compute(return_distance=return_distance, sort_results=True, strategy="parallel_on_Y")
mem_info = psutil.Process().memory_full_info()
print(f"Iteration {i} / {n_calls}", mem_info)
memory_info.append(mem_info)
gc.collect()
mem_info = psutil.Process().memory_full_info()
memory_info.append(mem_info)
title = "Memory usage on calls to RadiusNeighborhood.compute"
title += f"\n X.shape={X.shape} -- Y.shape={Y.shape} " \
f"radius={radius} -- return_distance={return_distance}"
df = pd.DataFrame(data=memory_info)
df[["data", "rss", "uss"]].plot(title=title,
xlabel="Number of calls",
ylabel="Size of segment (in bytes)",
figsize=(16, 9), fontsize=10)
plt.savefig(f"/tmp/n_calls={n_calls}_radius={radius}_return_distance={return_distance}.png")
plt.show() |
StdVectorSentinel makes a proper life-cycle management for std::vectors' buffers possible. Duplication seems needed as fused types can't be used as attributes. It's possible to obtain a missing symbol (`_ZSt28__throw_bad_array_new_lengthv`) at runtime. This is unrelated to the implementation here, and there are issues reporting the problem, e.g.: cython/cython#4218. A temporary workaround: stan-dev/pystan#294 (comment)
Also removes _finalise_compute.
f69536c
to
7c713a1
Compare
Reference Issues/PRs
Built on top of scikit-learn#20254
What does this implement/fix? Explain your changes.
Add a new reduction for querying neighbors using a radius (similarly to
RadiusNeighborsMixin._radius_neighbors_reduce_func
) usingstd::vector<std::vector> >
, see this snippet.Any other comments?