8000 MNT use `scipy.sparse.csgraph.shortest_path` in Isomap by TomDLT · Pull Request #20531 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT use scipy.sparse.csgraph.shortest_path in Isomap #20531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Aug 3, 2021

Conversation

TomDLT
Copy link
Member
@TomDLT TomDLT commented Jul 14, 2021

Takes over #14557
Fixes #14010 #8352
Closes #14557

From original pull-request:

As far as I can tell the sklearn.utils.graph_shortest_path is the same implementation originally in 2011 by Jake Vanderplas as scipy.sparse.csgraph.shortest_path (available since v0.11), except that the scipy version (unlike the one in scikit-learn) has received several improvements since.

As mentioned in the discussion in #14557, this pull-request introduces an issue with Isomap, since our own version of the graph shortest path would (wrongly) change infinite values into zeros. Infinite values can happen when there is more than one connected component in the neighbors graph. Changing to scipy's function does not change infinite values into zeros, so we could add this step back after calling scipy, but this step is wrong in the first place. Here, I propose to raises an error instead. The question about how to best add magic to Isomap to avoid raising an error is a separate issue that should not block this PR.

edit: I actually added some magic to connect disconnected components on the neighbors graph, to avoid raising errors.

@glemaitre
Copy link
Member

I assume that the remaining error is raised because we don't make any check_array on X. I think that we delegate the data validation to the NearestNeighbors (and set n_features_in_. I assume that we should most probably call _validate_data then.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing that PR @TomDLT !

A few comments below, but generally LGTM.

@TomDLT
Copy link
Member Author
TomDLT commented Aug 2, 2021

Thanks for the reviews.

Now, instead of copying code from agglomerative clustering into isomap, both estimators use the same new private function _fix_connected_components, which adds connections on a graph to get a single connected component. This function is now in sklearn.utils.graph, with a few tests.

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @TomDLT !

@rth rth merged commit 81165ca into scikit-learn:main Aug 3, 2021
sakibguy added a commit to sakibguy/scikit-learn that referenced this pull request Aug 4, 2021
MNT use `scipy.sparse.csgraph.shortest_path` in Isomap (scikit-learn#20531)
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Aug 4, 2021
@ogrisel ogrisel deleted the shortest_path branch August 4, 2021 12:49
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…20531)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@taha-yassine
8000
Copy link

Is there a particular reason why connecting connected components is prohibited when metric='precomputed' ? _fix_connected_components is normaly able to handle such case the way it is implemented.

@TomDLT
Copy link
Member Author
TomDLT commented Dec 6, 2021

With metric="precomputed", the algorithm does not have access to the original input data X, but only the precomputed neighbors graph. Adding connection to this graph is not possible without having access to X.

@taha-yassine
Copy link
taha-yassine commented Dec 7, 2021

In the case where metric='precomputed', X is the distance matrix both in Isomap._fit_transform and _fix_connected_components. _fix_connected_components only needs computed distances anyways so it shouldn't matter if the original data points aren't available. Maybe I'm missing something.

@TomDLT
Copy link
Member Author
TomDLT commented Dec 7, 2021

Thanks for insisting, you are right. There are two possibilities with metric='precomputed':

  • X contains a precomputed sparse neighbors graph. In this case, the missing distances cannot be computed, and _fix_connected_components does not work. The code raises a RuntimeError.
  • X contains a precomputed dense distance matrix. In this case, all pairwise distances are available, and _fix_connected_components does work. The code should not raise an error in this case. I have open a pull-request to fix this (FIX Isomap with precomputed distances and disconnected graph #21915).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange output from sklearn.manifold.Isomap
4 participants
0