[WIP] Refactor fibonacci tree, use in Ball Tree #11541

vyasr · 2018-07-15T22:27:14Z

Address Issue #597, which requests a refactoring of the existing fibonacci tree used for Dijkstra's algorithm in graph_nearest_neighbor.py in order to accelerate the Ball Tree in sklearn.neighbors.

This pull request is currently a work in progress. The fibonacci tree has been extracted, but the ball tree code still needs to be modified.

jjerphan · 2021-10-06T14:45:37Z

Hi @vyasr,

Do you still have some time to work on this PR?

vyasr · 2021-10-12T03:05:28Z

This PR almost certainly needs to be restarted from scratch given how long it's been, but I'm interested and can certainly take a look.

jjerphan · 2021-10-18T07:36:14Z

This PR almost certainly needs to be restarted from scratch given how long it's been, but I'm interested and can certainly take a look.

Please, do if you are.

Personally, I am interested in seeing what Fibonacci Heaps have to offer for various implementations.

jjerphan · 2022-08-01T12:30:35Z

Thank for exploring this, @vyasr.

In retrospective, I do not think Fibonacci heaps are relevant for scikit-learn.

Fibonacci heaps are theoretically speaking interesting construction over simple binary heaps: some operations are $\mathcal{O}(1)$ instead of $\mathcal{O}(\log n)$ (especially insertion, which is the only operation that is being used for KD-Tree and Ball-Tree).

But technically, working with nodes which aren't contiguous in memory (the case of Fibonacci heaps) is much more costly and inefficient than working with a single structure of contiguous arrays which have been allocated once (the case of the current binary heaps used in scikit-learn originally implemented by @jakevdp): this implementation of binary heaps is optimal, much easier to understand and maintain as it is a single fused-typed function. Moreover, graph algorithms previously implemented in sklearn/utils/graph_shortest_path.pyx (which could eventually have used this Fibonacci Heaps) have been removed in #20531.

Hence, I am closing this PR. Still, you might be interested in scipy/scipy#16696.

vyasr · 2022-08-01T20:01:49Z

@jjerphan thanks for following up. Yes I think you are right. I briefly started working on this after our last messages, and when I saw that the graph algorithms had been removed I wondered the same. IIRC there's also some issues because the ball tree implementation uses two different binary heaps for two different purposes (I believe a fixed-size one for tracking nearest neighbors and a variable-size one for performance BFS/DFS) and the needs and performance characteristics of each are different.

If there's future interest in this speedup I suspect that looking at pairing heaps for each of these uses separately might make sense. Pairing heaps are likely a better choice in practice than Fibonacci heaps given the operations in use and our interest in real-world performance, not just better scaling if it comes at the cost of larger prefactors. How they'll compare to a binary heap is less of a given, but if all we're doing is inserting then we may not see much performance gain by switching. On the plus side, the pairing heap implementation is far simpler than a Fibonacci heap (not quite as simple as a binary heap, but quite close) so it wouldn't be so painful to test out.

jjerphan · 2022-08-01T20:31:11Z

Thanks for your comment, @vyasr.

IIRC there's also some issues because the ball tree implementation uses two different binary heaps for two different purposes (I believe a fixed-size one for tracking nearest neighbors and a variable-size one for performance BFS/DFS) and the needs and performance characteristics of each are different.

That's exact. To add precision, both KDTree and BallTreeuse:

NeighborsHeaps, fixed-sized max-heaps (implemented using Structures of Arrays) for tracking the $k$-neighbors:

scikit-learn/sklearn/neighbors/_binary_tree.pxi

Lines 499 to 518 in 6e45a71

    
           cdef class NeighborsHeap: 
        
               """A max-heap structure to keep track of distances/indices of neighbors 
        
               This implements an efficient pre-allocated set of fixed-size heaps 
        
               for chasing neighbors, holding both an index and a distance. 
        
               When any row of the heap is full, adding an additional point will push 
        
               the furthest point off the heap. 
        
               Parameters 
        
               ---------- 
        
               n_pts : int 
        
                   the number of heaps to use 
        
               n_nbrs : int 
        
                   the size of each heap. 
        
               """ 
        
               cdef cnp.ndarray distances_arr 
        
               cdef cnp.ndarray indices_arr 
        
               cdef DTYPE_t[:, ::1] distances 
        
               cdef ITYPE_t[:, ::1] indices

NodeHeap, a custom heap for exploring the tree nodes for the BFS strategy (implemented using a Cython extension type with gil-bound methods and an Array of Structures)

scikit-learn/sklearn/neighbors/_binary_tree.pxi

Lines 633 to 648 in 6e45a71

    
           cdef class NodeHeap: 
        
               """NodeHeap 
        
               This is a min-heap implementation for keeping track of nodes 
        
               during a breadth-first search.  Unlike the NeighborsHeap above, 
        
               the NodeHeap does not have a fixed size and must be able to grow 
        
               as elements are added. 
        
               Internally, the data is stored in a simple binary heap which meets 
        
               the min heap condition: 
        
                   heap[i].val < min(heap[2 * i + 1].val, heap[2 * i + 2].val) 
        
               """ 
        
               cdef cnp.ndarray data_arr 
        
               cdef NodeHeapData_t[::1] data 
        
               cdef ITYPE_t n

I think, @jakevdp originally thought of improving the NeighborsHeaps (i.e. the max-heaps, used in both DFS and BFS)

If there's future interest in this speedup I suspect that looking at pairing heaps for each of these uses separately might make sense. Pairing heaps are likely a better choice in practice than Fibonacci heaps given the operations in use and our interest in real-world performance, not just better scaling if it comes at the cost of larger prefactors. How they'll compare to a binary heap is less of a given, but if all we're doing is inserting then we may not see much performance gain by switching. On the plus side, the pairing heap implementation is far simpler than a Fibonacci heap (not quite as simple as a binary heap, but quite close) so it wouldn't be so painful to test out.

That's exact. If you have time and are interested, feel free to have a look — I alternatively can.

Small parenthesis: the NodeHeap actually can be improved (or even replaced) using adapted (C++) data structures. If we manage to, we could make the method, and the whole BFS strategy nogil.

Refactor fibonacci tree

7782a91

vyasr mentioned this pull request Jul 15, 2018

Fibonacci Heaps for Ball Tree #597

Closed

TomDLT changed the title ~~Refactor fibonacci tree, use in Ball Tree~~ [WIP] Refactor fibonacci tree, use in Ball Tree Jul 16, 2018

TomDLT added the Performance label Jul 16, 2018

amueller added the Stalled label Aug 5, 2019

github-actions bot added the module:utils label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

thomasjpfan added the cython label Apr 13, 2021

jjerphan closed this Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Refactor fibonacci tree, use in Ball Tree #11541

[WIP] Refactor fibonacci tree, use in Ball Tree #11541

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] Refactor fibonacci tree, use in Ball Tree #11541

[WIP] Refactor fibonacci tree, use in Ball Tree #11541

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!