Add RadiusNeighborhood as a PairwiseDistancesReduction #3

jjerphan · 2021-07-05T15:31:25Z

Reference Issues/PRs

Built on top of scikit-learn#20254

What does this implement/fix? Explain your changes.

Add a new reduction for querying neighbors using a radius (similarly to RadiusNeighborsMixin._radius_neighbors_reduce_func) using std::vector<std::vector> >, see this snippet.

Any other comments?

Necessarily casting because APIs exposed via scipy.linalg.cython_blas aren't reflecting the const-identifier for arguments

This creates a double freeing as the buffers are tight to vectors attached to the instance. When the RadiusNeighborhood instance gets deleted, so do the buffers and the content or returned numpy array makes no sense. Also it crashes when ones deallocates the numpy arrays. One needs to have the final buffers life-time no be tight to the instance's vectors.

To solve the problem given above, we allocate dynamically allocate vectors which won't be freed, but their buffer eventually will as the ownership will be transferred to numpy arrays.

So that buffers are recreated and not freed.

ogrisel · 2021-07-05T15:38:00Z

Could you check that this does not cause a memory leak by calling RadiusNeighbors' query method in a loop, discarding the results each time and monitorying psutil.Process().memory_info().rss every 100 iterations of the loop for instance?

Try a big enough test set with k large enough such that the resulting neighbors data is around 1MB in memory for instance. The training set does not need to be large.

ogrisel

Exciting! Here is a few feedback:

sklearn/neighbors/tests/test_neighbors.py

sklearn/metrics/_parallel_reductions.pyx

jjerphan · 2021-07-05T16:15:49Z

Thanks for the review! Note that as of now, it's still WIP with some not so nice tactical advances to see if it can be functional before making it clean without too much difficulty, shall this be made functional.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

It gets automatically called then.

This reworks the synchronisation out of the first "with nogil, parallel"-context allowing more flexibility. For instance, this allow removing the lock previously needed for ArgKmin's synchronisation as we now parallelize on samples. This add a simple merge of vectors which should not work.

Use std::move to copy data from temporary vector to the main ones.

Also add comments.

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

Introduce PairwiseDistancesReduction.is_usable_for encapsulating the test for branching logic. Also removed unused code due to new implementation.

jjerphan · 2021-07-12T10:08:47Z

Edit: cd0634e introduced proper memory management.

Quick report regarding memory usage:

As of b8fe6e1, It looks like some data isn't deleted after calls to RadiusNeighborhood.compute solely.
This data size is proportional to the number of query vectors.

It looks like numpy arrays do not free their buffer, one can attest this applying this patch:

diff --git a/sklearn/metrics/_parallel_reductions.pyx b/sklearn/metrics/_parallel_reductions.pyx
index 57f3146664..95b60b2cd6 100644
--- a/sklearn/metrics/_parallel_reductions.pyx
+++ b/sklearn/metrics/_parallel_reductions.pyx
@@ -1052,3 +1052,6 @@ cdef class RadiusNeighborhood(PairwiseDistancesReduction):
         # by a numpy array then.
         del self.neigh_distances
         return _coerce_vectors_to_np_nd_arrays(self.neigh_indices)
+
+    def indices(self):
+        return np.asarray(deref(self.neigh_indices))
\ No newline at end of file

and seeing that RadiusNeighborhood.indices still returns a valid results after having del the numpy arrays:

In [1]: %run script_given_bellow.py
In [2]: dists, indices = rn.compute(return_distance=return_distance, sort_results=True, strategy="parallel_
   ...: on_Y")

In [3]: indices
Out[3]: 
array([array([4, 2, 5, 3, 0, 1, 6, 7, 8, 9]),
       array([6, 2, 5, 9, 1, 4, 3, 0, 8, 7]),
       array([4, 0, 9, 1, 3, 2, 7, 5, 6, 8]), ...,
       array([6, 8, 0, 3, 7, 9, 4, 1, 2, 5]),
       array([6, 1, 4, 2, 0, 7, 8, 3, 5, 9]),
       array([8, 9, 1, 6, 5, 3, 4, 7, 0, 2])], dtype=object)

In [4]: del indices

In [5]: rn.indices()
Out[5]: 
array([[4, 2, 5, ..., 7, 8, 9],
       [6, 2, 5, ..., 0, 8, 7],
       [4, 0, 9, ..., 5, 6, 8],
       ...,
       [6, 8, 0, ..., 1, 2, 5],
       [6, 1, 4, ..., 3, 5, 9],
       [8, 9, 1, ..., 7, 0, 2]])

Probably numpy's memory allocator is different and proper handling is needed as suggested by this response.

Script

from sklearn.metrics._parallel_reductions import RadiusNeighborhood
import numpy as np
import psutil
import pandas as pd
import gc
import matplotlib.pyplot as plt

if __name__ == "__main__":
    memory_info = []
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)
    rng = np.random.RandomState(0)
    X = rng.rand(10000, 100)
    Y = rng.rand(10, 100)
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)

    n_calls = 100
    radius = 10000
    return_distance = True

    for i in range(n_calls):
        rn = RadiusNeighborhood.get_for(X, Y, radius=radius)
        # Results are discarded
        rn.compute(return_distance=return_distance, sort_results=True, strategy="parallel_on_Y")
        mem_info = psutil.Process().memory_full_info()
        print(f"Iteration {i} / {n_calls}", mem_info)
        memory_info.append(mem_info)

    gc.collect()
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)

    title = "Memory usage on calls to RadiusNeighborhood.compute"
    title += f"\n X.shape={X.shape} -- Y.shape={Y.shape} " \
             f"radius={radius} -- return_distance={return_distance}"

    df = pd.DataFrame(data=memory_info)
    df[["data", "rss", "uss"]].plot(title=title,
                                    xlabel="Number of calls",
                                    ylabel="Size of segment (in bytes)",
                                    figsize=(16, 9), fontsize=10)
    plt.savefig(f"/tmp/n_calls={n_calls}_radius={radius}_return_distance={return_distance}.png")
    plt.show()

StdVectorSentinel makes a proper life-cycle management for std::vectors' buffers possible. Duplication seems needed as fused types can't be used as attributes. It's possible to obtain a missing symbol (`_ZSt28__throw_bad_array_new_lengthv`) at runtime. This is unrelated to the implementation here, and there are issues reporting the problem, e.g.: cython/cython#4218. A temporary workaround: stan-dev/pystan#294 (comment)

Also removes _finalise_compute.

jjerphan added 8 commits July 5, 2021 11:29

Cast pointer to const value for gemm interface

ed03b88

Necessarily casting because APIs exposed via scipy.linalg.cython_blas aren't reflecting the const-identifier for arguments

[WIP] Add RadiusNeighborhood

6c5e0b9

[WIP] Add RadiusNeighborhood

e37b147

[WIP] Add RadiusNeighborhood

dcca503

To solve the problem given above, we allocate dynamically allocate vectors which won't be freed, but their buffer eventually will as the ownership will be transferred to numpy arrays.

Excluse some distances from valid ones

eaedf53

Add temporary consistency test

70b13cd

Move results allocation from initialisation to compute

0f87e6f

So that buffers are recreated and not freed.

jjerphan added the No Changelog Needed label Jul 5, 2021

ogrisel reviewed Jul 5, 2021

View reviewed changes

jjerphan and others added 18 commits July 7, 2021 18:03

Address review comments

633d3f4

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Exclude some DistanceMetrics for PairwiseDistancesReduction

f5d2915

Introduce utils functions for vector to ndarray coercion

39c4788

Introduce parallel_on_Y for RadiusNeighborhood

b58321f

Sort returned valid distances

c24d184

Update comments

1ee3ae1

Change template for parallel_on_X

6ecf3f3

Fix number of threads for parallel_on_Y

4e0c465

Adapt test for 'brute' algorithm

30dcbba

Only allocate temporary buffer once

1a25476

Remove called to Argkmin.__dealloc__ in subclass'

93db1ee

It gets automatically called then.

Introduce parallel_on_Y for RadiusNeighborhood

937c4f3

Use std::move to copy data from temporary vector to the main ones.

Remove unnecessary intermediary sorts

9780f28

Also add comments.

Remove duplicate code to vector-to-ndarray coercion

40e369d

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

Use consistent names

a3de08a

Plug RadiusNeighborhood in RadiusNeighborsMixin.radius_neighbors

81b1b1b

Introduce PairwiseDistancesReduction.is_usable_for encapsulating the test for branching logic. Also removed unused code due to new implementation.

Correctly free vectors using del

b8fe6e1

jjerphan added 3 commits July 12, 2021 16:17

Remove temporary consistency test

9df4047

Add comments

7c713a1

Also removes _finalise_compute.

jjerphan force-pushed the pairwise_aggregation_cython-radius branch from f69536c to 7c713a1 Compare July 12, 2021 14:30

jjerphan merged commit 7ea6daa into pairwise_aggregation_cython Jul 13, 2021

jjerphan deleted the pairwise_aggregation_cython-radius branch February 23, 2022 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RadiusNeighborhood as a PairwiseDistancesReduction #3

Add RadiusNeighborhood as a PairwiseDistancesReduction #3

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add RadiusNeighborhood as a PairwiseDistancesReduction #3

Add RadiusNeighborhood as a PairwiseDistancesReduction #3

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!