8000 Add RadiusNeighborhood as a PairwiseDistancesReduction by jjerphan · Pull Request #3 · jjerphan/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add RadiusNeighborhood as a PairwiseDistancesReduction #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jjerphan
Copy link
Owner
@jjerphan jjerphan commented Jul 5, 2021

Reference Issues/PRs

Built on top of scikit-learn#20254

What does this implement/fix? Explain your changes.

Add a new reduction for querying neighbors using a radius (similarly to RadiusNeighborsMixin._radius_neighbors_reduce_func) using std::vector<std::vector> >, see this snippet.

Any other comments?

jjerphan added 8 commits July 5, 2021 11:29
Necessarily casting because APIs exposed
via scipy.linalg.cython_blas aren't reflecting
the const-identifier for arguments
This creates a double freeing as the buffers
are tight to vectors attached to the instance.

When the RadiusNeighborhood instance gets
deleted, so do the buffers and the content
or returned numpy array makes no sense.

Also it crashes when ones deallocates the numpy
arrays.

One needs to have the final buffers life-time
no be tight to the instance's vectors.
To solve the problem given above,
we allocate dynamically
allocate vectors which won't be
freed, but their buffer eventually will
as the ownership will be transferred to
numpy arrays.
So that buffers are recreated and not freed.
@ogrisel
Copy link
ogrisel commented Jul 5, 2021

Could you check that this does not cause a memory leak by calling RadiusNeighbors' query method in a loop, discarding the results each time and monitorying psutil.Process().memory_info().rss every 100 iterations of the loop for instance?

Try a big enough test set with k large enough such that the resulting neighbors data is around 1MB in memory for instance. The training set does not need to be large.

Copy link
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! Here is a few feedback:

@jjerphan
Copy link
Owner Author
jjerphan commented Jul 5, 2021

Thanks for the review! Note that as of now, it's still WIP with some not so nice tactical advances to see if it can be functional before making it clean without too much difficulty, shall this be made functional.

jjerphan and others added 18 commits July 7, 2021 18:03
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
It gets automatically called then.
This reworks the synchronisation out of the first
"with nogil, parallel"-context allowing more
flexibility.

For instance, this allow removing the lock previously
needed for ArgKmin's synchronisation as we now
parallelize on samples.

This add a simple merge of vectors which should not
work.
Use std::move to copy data from temporary vector
to the main ones.
Co-authored-by: Jérémie du Boisberranger
<jeremiedbb@users.noreply.github.com>
Introduce PairwiseDistancesReduction.is_usable_for encapsulating
the test for branching logic.

Also removed unused code due to new implementation.
@jjerphan
Copy link
Owner Author
jjerphan commented Jul 12, 2021

Edit: cd0634e introduced proper memory management.

Quick report regarding memory usage:

As of b8fe6e1, It looks like some data isn't deleted after calls to RadiusNeighborhood.compute solely.
This data size is proportional to the number of query vectors.

n_calls=100_radius=10000_return_distance=False
n_calls=100_radius=0_return_distance=True
n_calls=100_radius=0_return_distance=False
n_calls=100_radius=10000_return_distance=True

It looks like numpy arrays do not free their buffer, one can attest this applying this patch:

diff --git a/sklearn/metrics/_parallel_reductions.pyx b/sklearn/metrics/_parallel_reductions.pyx
index 57f3146664..95b60b2cd6 100644
--- a/sklearn/metrics/_parallel_reductions.pyx
+++ b/sklearn/metrics/_parallel_reductions.pyx
@@ -1052,3 +1052,6 @@ cdef class RadiusNeighborhood(PairwiseDistancesReduction):
         # by a numpy array then.
         del self.neigh_distances
         return _coerce_vectors_to_np_nd_arrays(self.neigh_indices)
+
+    def indices(self):
+        return np.asarray(deref(self.neigh_indices))
\ No newline at end of file

and seeing that RadiusNeighborhood.indices still returns a valid results after having del the numpy arrays:

In [1]: %run script_given_bellow.py
In [2]: dists, indices = rn.compute(return_distance=return_distance, sort_results=True, strategy="parallel_
   ...: on_Y")

In [3]: indices
Out[3]: 
array([array([4, 2, 5, 3, 0, 1, 6, 7, 8, 9]),
       array([6, 2, 5, 9, 1, 4, 3, 0, 8, 7]),
       array([4, 0, 9, 1, 3, 2, 7, 5, 6, 8]), ...,
       array([6, 8, 0, 3, 7, 9, 4, 1, 2, 5]),
       array([6, 1, 4, 2, 0, 7, 8, 3, 5, 9]),
       array([8, 9, 1, 6, 5, 3, 4, 7, 0, 2])], dtype=object)

In [4]: del indices

In [5]: rn.indices()
Out[5]: 
array([[4, 2, 5, ..., 7, 8, 9],
       [6, 2, 5, ..., 0, 8, 7],
       [4, 0, 9, ..., 5, 6, 8],
       ...,
       [6, 8, 0, ..., 1, 2, 5],
       [6, 1, 4, ..., 3, 5, 9],
       [8, 9, 1, ..., 7, 0, 2]])

Probably numpy's memory allocator is different and proper handling is needed as suggested by this response.

Script
from sklearn.metrics._parallel_reductions import RadiusNeighborhood
import numpy as np
import psutil
import pandas as pd
import gc
import matplotlib.pyplot as plt

if __name__ == "__main__":
    memory_info = []
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)
    rng = np.random.RandomState(0)
    X = rng.rand(10000, 100)
    Y = rng.rand(10, 100)
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)

    n_calls = 100
    radius = 10000
    return_distance = True

    for i in range(n_calls):
        rn = RadiusNeighborhood.get_for(X, Y, radius=radius)
        # Results are discarded
        rn.compute(return_distance=return_distance, sort_results=True, strategy="parallel_on_Y")
        mem_info = psutil.Process().memory_full_info()
        print(f"Iteration {i} / {n_calls}", mem_info)
        memory_info.append(mem_info)

    gc.collect()
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)

    title = "Memory usage on calls to RadiusNeighborhood.compute"
    title += f"\n X.shape={X.shape} -- Y.shape={Y.shape} " \
             f"radius={radius} -- return_distance={return_distance}"

    df = pd.DataFrame(data=memory_info)
    df[["data", "rss", "uss"]].plot(title=title,
                                    xlabel="Number of calls",
                                    ylabel="Size of segment (in bytes)",
                                    figsize=(16, 9), fontsize=10)
    plt.savefig(f"/tmp/n_calls={n_calls}_radius={radius}_return_distance={return_distance}.png")
    plt.show()

jjerphan added 3 commits July 12, 2021 16:17
StdVectorSentinel makes a proper life-cycle
management for std::vectors' buffers possible.

Duplication seems needed as fused types
can't be used as attributes.

It's possible to obtain a missing symbol
(`_ZSt28__throw_bad_array_new_lengthv`) at
runtime.

This is unrelated to the implementation here,
and there are issues reporting the problem, e.g.:
cython/cython#4218.

A temporary workaround:
stan-dev/pystan#294 (comment)
Also removes _finalise_compute.
@jjerphan jjerphan force-pushed the pairwise_aggregation_cython-radius branch from f69536c to 7c713a1 Compare July 12, 2021 14:30
@jjerphan jjerphan merged commit 7ea6daa into pairwise_aggregation_cython Jul 13, 2021
@jjerphan jjerphan deleted the pairwise_aggregation_cython-radius branch February 23, 2022 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0