8000 ENH Introduce `PairwiseDistancesReduction` and `PairwiseDistancesArgKmin` (feature branch) by jjerphan · Pull Request #22134 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Introduce PairwiseDistancesReduction and PairwiseDistancesArgKmin (feature branch) #22134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 17, 2022

Conversation

jjerphan
Copy link
Member
@jjerphan jjerphan commented Jan 6, 2022

Supersedes #21462.
Relates to #22587.

Task list as mentioned in #21462 (comment):



Results

Using this script:

On scikit-learn 1.0:

speed_up_100000_100000_log
speed_up_1000000_10000_log

This PR:

1
2

See #21462 for reference.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@thomasjpfan
Copy link
Member

I am okay with merging this into main and have a follow up PR hook everything up. This was the approach we took with 8000 the loss module: #20567

@lorentzenchr
Copy link
Member
lorentzenchr commented Jan 13, 2022

As this code is reviewed, we can merge it anytime (after merge conflicts are resolved).
@jjerphan Is there anything else you would like to have in this branch (where separate PR to main would be more difficult)?

@jjerphan
Copy link
Member Author

Following #21462 (comment), the last thing that is missing is interfacing those implementations with scikit-learn interfaces, adapting and completing the existing tests.

I was planning to stick to the plan, doing another sub-PR as initially planned not to have "dead-code" integrated.

But no strong opinion: I am also OK doing a separate pull-request as mentioned by @thomasjpfan.

@thomasjpfan
Copy link
Member

I was planning to stick to the plan, doing another sub-PR as initially planned not to have "dead-code" integrated.

No strong opinion either. We can stick with the original plan.

This means we need one more PR into pairwise-distances-argkmin to hook everything up?

@jjerphan
Copy link
Member Author

No strong opinion either. We can stick with the original plan.

I believe this is the best approach for posterity.

This means we need one more PR into pairwise-distances-argkmin to hook everything up?

That's it. I'll do it tomorrow.

@ogrisel
Copy link
Member
ogrisel commented Jan 14, 2022

I am also fine with either approaches. Whatever works for you best @jjerphan.

To get the coverage report greener, it should be possible to skip coverage in some branch of the openblas safe compat code:

https://app.codecov.io/gh/scikit-learn/scikit-learn/compare/22134/diff

@ogrisel
Copy link
Member
ogrisel commented Jan 14, 2022

Also test_pairwise_distances_argkmin has a special branch for "haversine" but "haversine" is not part of the CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS list.

Either it should be added to the list or that branch should be removed from the test.

https://app.codecov.io/gh/scikit-learn/scikit-learn/compare/22134/diff#D5L329

@jjerphan jjerphan changed the title ENH Pairwise Distances ArgKmin (feature branch) ENH Introduce PairwiseDistancesReductions and PairwiseDistancesReductionsArgKmin (feature branch) Jan 28, 2022
@jjerphan jjerphan changed the title ENH Introduce PairwiseDistancesReductions and PairwiseDistancesReductionsArgKmin (feature branch) ENH Introduce PairwiseDistancesReduction and PairwiseDistancesArgKmin (feature branch) Jan 28, 2022
* Forward pairwise_dist_chunk_size in the configuration

* Flip finalized results for PairwiseDistancesArgKmin

The previous would have made the code more complex
by introducing some boilerplate for the interface plugs.

Having it this way actually simplifies the code.

This also removes the haversine branch for
test_pairwise_distances_argkmin

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* Plug PairwiseDistancesArgKmin as a back-end

* Adapt test accordingly

* Add whats_new entry

* Change input validation order for kneighbors

* Remove duplicated test_neighbors_distance_metric_deprecation

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* Adapt the documentation

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* Add mahalanobis case to test fixtures

* Correct whats_new entry

* CLN Remove unneeded private metric attribute

This was needed when 'fast_sqeuclidean' and 'fast_euclidean'
were present to choose the best implementation based on the user
specification.

Those metric have been removed since then, making this attribute
useless.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* TST Assert FutureWarning instead of DeprecationWarning in
test_neighbors_metrics

* MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Add a test for the 'brute' backends' results' consistency

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Filter FutureWarning for WMinkowskiDistance

* MAINT pin numpydoc in arm for now (#22292)

* fixup! TST Filter FutureWarning for WMinkowskiDistance

* Revert keywords arguments removal for the GEMM trick for 'euclidean'

* MAINT pin max numpydoc for now (#22286)

* Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* Apply suggestions from code review

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* MAINT Document some config parameters for maintenance

Also rename one of them.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX Support and test one of 'sqeuclidean' specification

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX Various typos fix and correct haversine

'haversine' is not supported by cdist.

* Directly use get_config

* CLN Apply comments from review

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Jérémie du Boisberranger
<jeremiedbb@users.noreply.github.com>

* Motivate swapped returned values

Co-authored-by: Jérémie du Boisberranger
<jeremiedbb@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* TST Remove mahalanobis from test fixtures

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

* MNT Add comment regaduction functions' signatures

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* TST Complete test for `pairwise_distance_{argmin,argmin_min}` (#22371)

* DOC Add sub-pull requests to the whats_new entry
@lorentzenchr
Copy link
Member

This needs to by synched with main and then ... Is there anything left or can those 2k loc be merged? I guess, the honor would be yours, @jjerphan.

@jjerphan
Copy link
Member Author

Is there anything left or can those 2k loc be merged?

Nothing that I can remember of.

@lorentzenchr
Copy link
Member

Is there anything left or can those 2k loc be merged?

Nothing that I can remember of.

In that case, let's wait one or a few days, if anybody has any further comment. Eventually

  1. put it out of draft modus
  2. the honor is yours @jjerphan (if you like and are available to push the merge button)

@jjerphan jjerphan marked this pull request as ready for review February 10, 2022 19:54
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, otherwise LGTM

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr lorentzenchr merged commit 6a16763 into main Feb 17, 2022
@lorentzenchr lorentzenchr deleted the pairwise-distances-argkmin branch February 17, 2022 09:38
@GaelVaroquaux
Copy link
Member

Woot!! This is a big deal!

Congratulations to everyone involved!!!

@GaelVaroquaux
Copy link
Member

Thanks to @jjerphan for the hard work, and to the reviewers!!

@jjerphan
Copy link
Member Author

Coming back from holidays ⛰️


Thanks to everyone involved! This was a big piece which paves the way to smaller PRs, which reviews hopefully would be eased.

Thanks to @jjerphan for the hard work, and to the reviewers!!

You're welcome.

thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Mar 1, 2022
…min` (feature branch) (scikit-learn#22134)

* MAINT Introduce Pairwise Distances Reductions private submodule  (scikit-learn#22064)

* MAINT Introduce FastEuclideanPairwiseArgKmin  (scikit-learn#22065)

* fixup! Merge branch 'main' into pairwise-distances-argkmin

Remove duplicated Bunch

* MAINT Plug `PairwiseDistancesArgKmin` as a back-end (scikit-learn#22288)

* Forward pairwise_dist_chunk_size in the configuration

* Flip finalized results for PairwiseDistancesArgKmin

The previous would have made the code more complex
by introducing some boilerplate for the interface plugs.

Having it this way actually simplifies the code.

This also removes the haversine branch for
test_pairwise_distances_argkmin

* Plug PairwiseDistancesArgKmin as a back-end

* Adapt test accordingly

* Add whats_new entry

* Change input validation order for kneighbors

* Remove duplicated test_neighbors_distance_metric_deprecation

* Adapt the documentation

* Add mahalanobis case to test fixtures

* Correct whats_new entry

* CLN Remove unneeded private metric attribute

This was needed when 'fast_sqeuclidean' and 'fast_euclidean'
were present to choose the best implementation based on the user
specification.

Those metric have been removed since then, making this attribute
useless.

* TST Assert FutureWarning instead of DeprecationWarning in
test_neighbors_metrics

* MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Add a test for the 'brute' backends' results' consistency

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config

* TST Filter FutureWarning for WMinkowskiDistance

* MAINT pin numpydoc in arm for now (scikit-learn#22292)

* fixup! TST Filter FutureWarning for WMinkowskiDistance

* Revert keywords arguments removal for the GEMM trick for 'euclidean'

* MAINT pin max numpydoc for now (scikit-learn#22286)

* Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS

* Apply suggestions from code review

* MAINT Document some config parameters for maintenance

Also rename one of them.

* FIX Support and test one of 'sqeuclidean' specification

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX Various typos fix and correct haversine

'haversine' is not supported by cdist.

* Directly use get_config

* CLN Apply comments from review

* Motivate swapped returned values

* TST Remove mahalanobis from test fixtures

* MNT Add comment regaduction functions' signatures

* TST Complete test for `pairwise_distance_{argmin,argmin_min}` (scikit-learn#22371)

* DOC Add sub-pull requests to the whats_new entry

* DOC place comment inside functions

* DOC move up whatsnew entry

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0