8000 [WIP] Refactor fibonacci tree, use in Ball Tree by vyasr · Pull Request #11541 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Refactor fibonacci tree, use in Ball Tree #11541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

vyasr
Copy link
@vyasr vyasr commented Jul 15, 2018

Address Issue #597, which requests a refactoring of the existing fibonacci tree used for Dijkstra's algorithm in graph_nearest_neighbor.py in order to accelerate the Ball Tree in sklearn.neighbors.

This pull request is currently a work in progress. The fibonacci tree has been extracted, but the ball tree code still needs to be modified.

@TomDLT TomDLT changed the title Refactor fibonacci tree, use in Ball Tree [WIP] Refactor fibonacci tree, use in Ball Tree Jul 16, 2018
Base automatically changed from master to main January 22, 2021 10:50
@jjerphan
Copy link
Member
jjerphan commented Oct 6, 2021

Hi @vyasr,

Do you still have some time to work on this PR?

@vyasr
Copy link
Author
vyasr commented Oct 12, 2021

This PR almost certainly needs to be restarted from scratch given how long it's been, but I'm interested and can certainly take a look.

@jjerphan
Copy link
Member

This PR almost certainly needs to be restarted from scratch given how long it's been, but I'm interested and can certainly take a look.

Please, do if you are.

Personally, I am interested in seeing what Fibonacci Heaps have to offer for various implementations.

@jjerphan
Copy link
Member
jjerphan commented Aug 1, 2022

Thank for exploring this, @vyasr.

In retrospective, I do not think Fibonacci heaps are relevant for scikit-learn.

Fibonacci heaps are theoretically speaking interesting construction over simple binary heaps: some operations are $\mathcal{O}(1)$ instead of $\mathcal{O}(\log n)$ (especially insertion, which is the only operation that is being used for KD-Tree and Ball-Tree).

But technically, working with nodes which aren't contiguous in memory (the case of Fibonacci heaps) is much more costly and inefficient than working with a single structure of contiguous arrays which have been allocated once (the case of the current binary heaps used in scikit-learn originally implemented by @jakevdp): this implementation of binary heaps is optimal, much easier to understand and maintain as it is a single fused-typed function. Moreover, graph algorithms previously implemented in sklearn/utils/graph_shortest_path.pyx (which could eventually have used this Fibonacci Heaps) have been removed in #20531.

Hence, I am closing this PR. Still, you might be interested in scipy/scipy#16696.

@jjerphan jjerphan closed this Aug 1, 2022
@vyasr
Copy link
Author
vyasr commented Aug 1, 2022

@jjerphan thanks for following up. Yes I think you are right. I briefly started working on this after our last messages, and when I saw that the graph algorithms had been removed I wondered the same. IIRC there's also some issues because the ball tree implementation uses two different binary heaps for two different purposes (I believe a fixed-size one for tracking nearest neighbors and a variable-size one for performance BFS/DFS) and the needs and performance characteristics of each are different.

If there's future interest in this speedup I suspect that looking at pairing heaps for each of these uses separately might make sense. Pairing heaps are likely a better choice in practice than Fibonacci heaps given the operations in use and our interest in real-world performance, not just better scaling if it comes at the cost of larger prefactors. How they'll compare to a binary heap is less of a given, but if all we're doing is inserting then we may not see much performance gain by switching. On the plus side, the pairing heap implementation is far simpler than a Fibonacci heap (not quite as simple as a binary heap, but quite close) so it wouldn't be so painful to test out.

@jjerphan
Copy link
Member
jjerphan commented Aug 1, 2022

Thanks for your comment, @vyasr.

IIRC there's also some issues because the ball tree implementation uses two different binary heaps for two different purposes (I believe a fixed-size one for tracking nearest neighbors and a variable-size one for performance BFS/DFS) and the needs and performance characteristics of each are different.

That's exact. To add precision, both KDTree and BallTreeuse:

  • NeighborsHeaps, fixed-sized max-heaps (implemented using Structures of Arrays) for tracking the $k$-neighbors:

    cdef class NeighborsHeap:
    """A max-heap structure to keep track of distances/indices of neighbors
    This implements an efficient pre-allocated set of fixed-size heaps
    for chasing neighbors, holding both an index and a distance.
    When any row of the heap is full, adding an additional point will push
    the furthest point off the heap.
    Parameters
    ----------
    n_pts : int
    the number of heaps to use
    n_nbrs : int
    the size of each heap.
    """
    cdef cnp.ndarray distances_arr
    cdef cnp.ndarray indices_arr
    cdef DTYPE_t[:, ::1] distances
    cdef ITYPE_t[:, ::1] indices

  • NodeHeap, a custom heap for exploring the tree nodes for the BFS strategy (implemented using a Cython extension type with gil-bound methods and an Array of Structures)

    cdef class NodeHeap:
    """NodeHeap
    This is a min-heap implementation for keeping track of nodes
    during a breadth-first search. Unlike the NeighborsHeap above,
    the NodeHeap does not have a fixed size and must be able to grow
    as elements are added.
    Internally, the data is stored in a simple binary heap which meets
    the min heap condition:
    heap[i].val < min(heap[2 * i + 1].val, heap[2 * i + 2].val)
    """
    cdef cnp.ndarray data_arr
    cdef NodeHeapData_t[::1] data
    cdef ITYPE_t n

I think, @jakevdp originally thought of improving the NeighborsHeaps (i.e. the max-heaps, used in both DFS and BFS)

If there's future interest in this speedup I suspect that looking at pairing heaps for each of these uses separately might make sense. Pairing heaps are likely a better choice in practice than Fibonacci heaps given the operations in use and our interest in real-world performance, not just better scaling if it comes at the cost of larger prefactors. How they'll compare to a binary heap is less of a given, but if all we're doing is inserting then we may not see much performance gain by switching. On the plus side, the pairing heap implementation is far simpler than a Fibonacci heap (not quite as simple as a binary heap, but quite close) so it wouldn't be so painful to test out.

That's exact. If you have time and are interested, feel free to have a look — I alternatively can.


Small parenthesis: the NodeHeap actually can be improved (or even replaced) using adapted (C++) data structures. If we manage to, we could make the method, and the whole BFS strategy nogil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0