8000 [MRG] Neighbors refactor by larsmans · Pull Request #4217 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Neighbors refactor #4217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed

Conversation

larsmans
Copy link
Member
@larsmans larsmans commented Feb 7, 2015

Big refactor of the nearest neighbors/space partitioning code.

Currently, KD-tree and ball tree include (textually) the source of their base class (binary_tree.pxi) so that the code in that module gets compiled twice. The code in this PR merges the modules and compiles the base class once.

Also: removal of workaround for old Cython and NumPy versions, tiny optimizations (less Python C-API calling).

Preliminary to optimizing radius neighbor queries as promised in #4157.

@larsmans larsmans force-pushed the neighbors-refactor branch 3 times, most recently from 1b5e83e to 7fb3b60 Compare February 7, 2015 14:08
Saves ~40kLOC of generated C code.
@larsmans
Copy link
Member Author
larsmans commented Feb 7, 2015

There; -42563 lines of code.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) to 95.02% when pulling f04ecb4 on larsmans:neighbors-refactor into 4629366 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) to 95.02% when pulling f04ecb4 on larsmans:neighbors-refactor into 4629366 on scikit-learn:master.

@larsmans larsmans changed the title [WIP] Neighbors refactor [MRG] Neighbors refactor Feb 7, 2015
@larsmans
Copy link
Member Author
larsmans commented Feb 7, 2015

Tests pass. @jakevdp @MechCoder @jnothman?

@@ -1,5 +1,6 @@
#!python
cimport numpy as np
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a Python import valid in a pxd file? I thought that it doesn't actually do anything.

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

I have one concern: when I first wrote this code, I did the literal inclusion junk because I found that due to the virtual table architecture of Cython classes, the compiled Cython couldn't inline methods from derived classes, and this led to a performance penalty.

I see that in the refactor, you've undone this and made the utilities which are specific to BallTree/KDTree into class methods (this is actually how I first implemented the code!). That's definitely better OO design, and it would be the way to go in C++ or Python, but I worry about the performance implications in Cython. Can newer versions of Cython inline methods from virtual tables now? If not, I fear that this is going to be slower than the old version – and perhaps a lot slower in some cases.

I'd like to see some comprehensive before/after benchmarks before this is merged. You may be able to make use of some of the scripts in https://github.com/jakevdp/BinaryTree – that's the repository where I experimented for several months before opening the scikit-learn PR with this functionality. Somewhere in that history there's some code that looks a lot more like this refactor 😄

@jnothman
Copy link
Member
jnothman commented Feb 8, 2015

Given #4213, can we please rename dist_metrics -> _dist_metrics?

@larsmans
Copy link
Member Author
larsmans commented Feb 8, 2015

@jakevdp Here's the graphs from benchmarks/bench_plot_neighbors.py in the scikit-learn code, before:

digits-before
dense-before

After:

digits-after
dense-after

I don't see massive differences. Note that I moved all of the recursive algorithms to toplevel (they were cdef methods) exactly because of the virtual method call overhead.

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

Great! Thanks for taking a look at that. It looks like the two "after" plots are duplicates: do you have the times for the dense data "after"?

@larsmans
Copy link
Member Author
larsmans commented Feb 8, 2015

@jakevdp My bad, uploaded the right picture. Performance differences are still very hard to spot.

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

Performance differences are still very hard to spot.

Indeed! It looks really good. Definitely a much cleaner implementation.

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

I just noticed that the benchmarks do nothing with the dual tree implementation. That was one of the things I recall being very slow with the derived class pattern. Would you mind checking that as well?

@amueller
Copy link
Member

Also it would be cool if we could have a more direct comparison. Comparing the current plots is really hard. Some pandas magic should do the trick to combine the two datasets ;)

@MuellerSeb
Copy link

Any news on that?

@rth
Copy link
Member
rth commented Apr 7, 2020

A small part of this PR was extracted in #15097, but generally continuing this would require significant work. Now that scikit-learn supports pre-computed sparse distances (including from other faster ANN libraries for high (>100) dimensional datasets) there is a question of how much effort we should spend on this binary tree implementation.

Though of course if someone interested in contributing PR to improve the situation they would be very welcome.

@jjerphan
Copy link
Member

After trying to rebase this PR via #22745, reflecting on the last changes that are left on the rebase (mostly refactoring private methods into private functions) and taking into account the reported results of benchmarks, I do not think it is worth integrating the proposed changes.

Here are, in my opinion, several reasons:

  • Firstly, and as stated by @larsmans (the author of this PR) in [MRG] Neighbors refactor #4217 (comment) the performance improvements are relatively small, and it's not clear whether it always comes with performances
  • Secondly, this PR proposes "unusual" modifications (e.g. reducing method dispatch overhead by converting methods to functions). IMO overhead due to the language should be addressed by the language, not by users knowing such language limitations (especially if those limitations can get fixed in the future). In this regards, performances improvements better be obtained by others means, namely by:
    • using appropriate data-structures and algorithms (it looks like the ones use there are appropriate for those implementations)
    • using reference containers such as the one from libcpp instead of reimplementing them in Cython
    • releasing the gil upwind (if possible)
    • changing the pattern of execution in its context by eventually using lower-level parallelism with OpenMP via Cython or tasks scheduling libraries (if possible)
  • Thirdly, re-factoring this piece of code which has not moved for years might come with unexpected or weird scenario maintainers want to avoid as much as possible. Such scenario include de-serialisation issues, performances regression due to the change of an algorithms which has edge-cases, breaking other people projects which have been built against some of scikit-learn private Cython API, etc.)

Still, I thinking some idea from this PR can be picked. I am thinking of:

  • using inheritance now that Cython supported polymorphism efficiently
  • typing variables implementation and signatures

What do you think?

@jjerphan
Copy link
Member

If there were no objection, I assume this can be closed.

Thank you for exploring improvements some of which have been integrated individually, @larsmans!

@jjerphan jjerphan closed this Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0