8000 ENH Add `float32` implementations for `BallTree` and `KDTree` by OmarManzoor · Pull Request #25914 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add float32 implementations for BallTree and KDTree #25914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Aug 1, 2023

Conversation

OmarManzoor
Copy link
Contributor
@OmarManzoor OmarManzoor commented Mar 20, 2023

Reference Issues/PRs

Fixes #15474

What does this implement/fix? Explain your changes.

  • Adds support for float32 in BallTree and KDTree using Tempita.
  • The BinaryTree is also modified to support float32 using Tempita.

ℹ️ Note that those new implementations for float32 are tested but aren't usable from the KDTree and BallTree public API because the implementations for float64 need to be decoupled from those API. This will be done in another PR.

Any other comments?

CC: @jjerphan

@jjerphan
Copy link
Member

Hi Omar, thank you for looking into it.

Adding support for float32 in binary trees is major work.

I think we might want to proceed in several steps using a feature branch.

This way, we could create PRs to:

  • port those implementations to float32
  • add tests on float32
  • add support for float32 in code path of estimators using BallTree and KDTree

What do people think?

@OmarManzoor
Copy link
Contributor Author

@jjerphan Could you kindly check the error and let me know what I am missing? It is an error related to
ImportError: dynamic module does not define module export function (PyInit__binary_tree)

@jjerphan
Copy link
Member

This stems from pxi sources having an association Cython extension registered when compiling scikit-learn.

Note I am not able to make the test fail on my end, but I think this issue can be fixed by performing this change:

diff --git i/setup.py w/setup.py
index f5522600f..a07506013 100755
--- i/setup.py
+++ w/setup.py
@@ -497,10 +497,10 @@ def configure_extension_modules():
                 # `source` is a Tempita file
                 tempita_sources.append(source)
 
-                # Do not include pxd files that were generated by tempita
-                if os.path.splitext(new_source_path)[-1] == ".pxd":
-                    continue
-                sources.append(new_source_path)
+                # Do not include header files (".pxd") and include files ("pxi")
+                # that were generated by Tempita.
+                if os.path.splitext(new_source_path)[-1] not in (".pxd", ".pxi"):
+                    sources.append(new_source_path)
 
             gen_from_templates(tempita_sources)

@OmarManzoor
Copy link
Contributor Author

Hi Omar, thank you for looking into it.

Adding support for float32 in binary trees is major work.

I think we might want to proceed in several steps using a feature branch.

This way, we could create PRs to:

  • port those implementations to float32
  • add tests on float32
  • add support for float32 in code path of estimators using BallTree and KDTree

What do people think?

How about we include the cython changes and basic tests for ball_tree and kd_tree in one PR and then a follow up PR can be used to treat any underlying estimators which make use of ball_tree or kd_tree?

@jjerphan
Copy link
Member
jjerphan commented Mar 21, 2023

We also can.

I think I would wait for other maintainers' point of view for the relevance of this PR.

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few notes.

This increase the size of binary outputs, but I think this is acceptable.

Details

On 3043ea6 (main):

sklearn/neighbors/_partition_nodes.cpython-311-x86_64-linux-gnu.so: 29656 bytes
sklearn/neighbors/_kd_tree.cpython-311-x86_64-linux-gnu.so: 510104 bytes
sklearn/neighbors/_ball_tree.cpython-311-x86_64-linux-gnu.so: 515280 bytes

On 301afc9 (this branch):

sklearn/neighbors/_partition_nodes.cpython-311-x86_64-linux-gnu.so: 34192 bytes
sklearn/neighbors/_kd_tree.cpython-311-x86_64-linux-gnu.so: 709496 bytes
sklearn/neighbors/_ball_tree.cpython-311-x86_64-linux-gnu.so: 710944 bytes

Hence:

  • 15% binary size increase for _partition_nodes
  • 39% binary size increase for _kd_tree
  • 37% binary size increase for _ball_tree

Also, could you add a changelog?

@jjerphan jjerphan changed the title ENH add float32 support in BallTree and KDTree ENH BallTree and KDTree support for float32 Mar 22, 2023
jjerphan
jjerphan previously approved these changes Mar 22, 2023
Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the last comments are treated.

Thank you, @OmarManzoor!

@jjerphan
Copy link
Member

I think this PR is ready for a review, @OmarManzoor.

What do you think?

@OmarManzoor
Copy link
Contributor Author

I think this PR is ready for a review, @OmarManzoor.

What do you think?

Well I think there might be two items remaining here:

  • We still need to adjust the relevant return types like density and distance using Tempita so that the dtype is preserved as you have suggested in two of the tests.
  • It might be a good idea to have similar tests for test_quad_tree as well.

What do you think?

@OmarManzoor OmarManzoor marked this pull request as ready for review March 24, 2023 13:57
Copy link
Contributor
@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API and implementation look good as discussed. Some minor notes to clean up tests a bit. Thanks!

@jjerphan
Copy link
Member
jjerphan commented Jul 5, 2023

@Micky774: what do you think of requiring a third reviewer's approval for this PR?

If you or another maintainer is OK with our two approvals, I let you merge this PR.

Edit: I added a note in the description of the PR to mention that those implementations aren't usable from the API yet.

@Micky774
Copy link
Contributor
Micky774 commented Jul 7, 2023

@Micky774: what do you think of requiring a third reviewer's approval for this PR?

Sure. It's a novel pattern, so extra consideration won't hurt (plus it's not a super urgent change).

General ping for anyone w/ bandwidth to review this (mainly concerning API) @scikit-learn/core-devs

For those that are interested, see this comment for a cleaner diff than with the github UI.

@Micky774
Copy link
Contributor

@ogrisel Wanted to go ahead and ping you in case you were interested in discussing the path we're taking here!

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing the usage of pxi files and Tempita does make this part of the code base a more complicated. I do not see any great alternatives, so I am okay with moving forward with this PR.

@haiatn
Copy link
Contributor
haiatn commented Jul 29, 2023

I see the linter check failed, but I didn't understand why?

@jjerphan
Copy link
Member

There are sometimes sporadic failures due to the network or the CI infrastructure. I reran the job and it passed.

@OmarManzoor
Copy l 10000 ink
Contributor Author

@thomasjpfan Can this be merged?

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not actually use the new KDTree32 or BallTree32 in non-testing code.

@Micky774 @OmarManzoor @jjerphan For transparency, can one of you open an issue regarding the roadmap for what work needs to be done after this PR is merged?

{{endfor}}


class BallTree(BallTree64):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BallTree docs are missing:

Suggested change
class BallTree(BallTree64):
class BallTree(BallTree64):
__doc__ = CLASS_DOC.format(BinaryTree="BallTree")

(Looking over CLASS_DOC, it only requires BinaryTree)

{{endfor}}


class KDTree(KDTree64):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding __doc__.

@thomasjpfan thomasjpfan merged commit 1bd831a into scikit-learn:main Aug 1, 2023
@Micky774
Copy link
Contributor
Micky774 commented Aug 1, 2023

Congratulations @OmarManzoor 🥳! Thank you for your work.

@jjerphan
Copy link
Member
jjerphan commented Aug 1, 2023

Congratulations, @OmarManzoor!

@OmarManzoor
Copy link
Contributor Author

@Micky774 @jjerphan Thank you for all the guidance!

@OmarManzoor OmarManzoor deleted the ball_and_kd_tree_float32 branch August 1, 2023 17:54
9Y5 pushed a commit to 9Y5/scikit-learn that referenced this pull request Aug 2, 2023
…-learn#25914)

Co-authored-by: Omar Salman <omar.salman@arbisoft>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…-learn#25914)

Co-authored-by: Omar Salman <omar.salman@arbisoft>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for float32 in KDTree and BallTree
5 participants
0