8000 API for getting DBSCAN-like clusterings out of OPTICS with `fit_predict` · Issue #12044 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

API for getting DBSCAN-like clusterings out of OPTICS with fit_predict #12044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Sep 8, 2018 · 76 comments · Fixed by #19024
Closed

API for getting DBSCAN-like clusterings out of OPTICS with fit_predict #12044

jnothman opened this issue Sep 8, 2018 · 76 comments · Fixed by #19024
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@jnothman
Copy link
Member
jnothman commented Sep 8, 2018

Currently we have an interface for OPTICS with custom method extract_dbscan. This is good for usability and visibility of the functionality, but means that a generic parameter search tool (like GridSearchCV) can't use OPTICS to perform DBSCAN at various eps.

This would involve adding an eps parameter which, when None, would use the default OPTICS clustering; when not None would use extract_dbscan. But we would also need to retain the model across multiple fits...

Here are two alternative interfaces:

  • Add a warm_start parameter (like many classifiers, regressors, but uncharted territory for clusterers). When True, and fit or fit_predict is called, the current reachability_, ordering_ and core_distances_ would be kept, but a different final clustering step would be used to output / store labels_.
  • Add a memory parameter, like in hierarchical clustering. This would cache the mapping from parameters to reachability_, ordering_ and core_distances_ using a joblib.Memory.

I think the first option sounds more appropriate.

@jnothman jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Sep 8, 2018
@Sriharsha-hatwar
Copy link
Contributor

Hey @jnothman I would like to take up this, is there any basic requirement that is already there in the code base that I need to be acquainted with ?

@adrinjalali
Copy link
Member

@jnothman, I think #12036 is kinda related to this one.

@jnothman
Copy link
Member Author

Well, I wouldn't think of this as a good first issue, @Sriharsha-hatwar. Please try contribute to something more straightforward until you are familiar with our contribution process.

@adrinjalali, yes it is... I'm yet to read that.

@ghost
Copy link
ghost commented Sep 12, 2018

I have worked on a documentation issue earlier. Can I take this issue? If yes, I will go through the issue and code to make sure that I can handle it.

@jnothman
Copy link
Member Author

Do you feel comfortable with what this issue is suggesting?

@ghost
Copy link
ghost commented Sep 13, 2018

I am not very comfortable yet but I am trying to read the background to understand it further. Meanwhile, if someone else pops up to fix this issue they can take it.

@adrinjalali
Copy link
Member

Doesn't this apply to all extraction methods? The warm_start would apply to the first stage (_extract_optics_ordering), and the rest can use the calculated values independent of the extraction method.

I guess what I'm asking, is that, should we then add all parameters related to extract methods to the fit method?

@jnothman
Copy link
Member Author
jnothman commented Sep 16, 2018 via email

@qinhanmin2014
Copy link
Member

I think we need to make decision here ASAP since we've promised OPTICS in 0.21.

My main points for API design are as follows:
(1) Users don't need to calculate RD/CD again if they want to extract clusters with different methods/different parameters.
(2) If possible, provide users with a way to only calculate RD/CD (e.g., method=None). Actually, OPTICS itself only returns RD/CD.

Regarding warm_start, the problem is, if we reuse RD/CD, we need to ensure that users pass in the same X. Is it possible for warm_start?

Maybe an alternative solution is not including fit_predict and predict here, so we'll have fit (calculates RD/CD), extract_dbscan, extract_.... But the problem here is that it seems strange that a cluster algorithm doesn't have both fit_predict and predict.

@qinhanmin2014 qinhanmin2014 added this to the 0.21 milestone Oct 15, 2018
@jnothman
Copy link
Member Author
jnothman commented Oct 15, 2018 via email

@adrinjalali
Copy link
Member

I also think OPTICS as a class in clustering should satisfy the usual API requirements.

Once we have an extract_"method name" function for the three extraction methods, users can use the calculated RD/CD from OPTICS to extract clusters without having to run fit.

We can add a public calculate_OPTICS_RD_CD function or whatever and not run the clustering part of the algorithm. However, I wouldn't worry about this part since all the extract methods are of O(n). We almost have that function already anyway.

We can have a naive warm_start for now, which would basically ignore X if RD/CD are present already. I'm yet to read the other warm_start examples we have in the code base, but do they check for having the same X as the input as well?

I'd see the hashing/caching mechanism as a separate project, which would affect probably most classes we have. Since we already do have OPTICS pretty modular, I guess adopting the mechanism would be easy here.

@jnothman
Copy link
Member Author
jnothman commented Oct 15, 2018 via email

@qinhanmin2014
Copy link
Member

I think we need to provide fit_predict and labels_ for API consistency and just note that this is something we do poorly.

I have to admit that I can't understand the importance of (such a strict) API consistency now. Following Joel's decision, I think I'm fine with things like #12087 (i.e., provide a extract_method parameter so that we can provide labels_ after fit/fit_predict, and provide function extract_XXX to users so that they can easily try different methods without calculating RD again.)
WDYT? @jnothman

@qinhanmin2014
Copy link
Member

And maybe we can provide extract_method = None (i.e., doesn't return labels after fit, raise an error for fit_predict). This will enable users to get CD/RD quickly and make it easier for us to benchmark the algorithm.

@adrinjalali
Copy link
Member

I find having an extract_RD function or something more elegant than extract_method=None and breaking the API.

@qinhanmin2014
Copy link
Member

breaking the API.

Why will extract_method = None break the API? I think it's acceptable that when certain param is set to certain value, certain attribute will not be available (e.g., store_cv_values in RidgeClassifierCV).

I find having an extract_RD function

How will you implement it?

My solution is something like:

clust = OPTICS(extract_method=...)
# calculate CD/RD, return labels_ of specific extraction method
clust.fit(X)
# try different extraction methods/different parameters
# without calculating CD/RD again
clust.extract_XXX(...=...)

@adrinjalali
Copy link
Member

To me, OPTICS is a "clustering" algorithm, and I expect to be able to use any such model in a pipeline where using any clustering is apt. For instance, I would expect to be able to easily use it in a grid search, with a Silhouette score to optimize hyperparameters. That means I expect the object to be in a valid state after a call to fit, which for a clustering object means to have labels_ set.

store_cv_values in RidgeClassifierCV doesn't break the API, cause not many (if any) other modules depend on that variable to be set to function properly; in contrast, a scoring function would expect a clustering object to have a valid labels_ property, hence not having it breaks the API.

That said, I personally find it odd that not all clusterings have predict function implemented. It feels odd to me to rely on labels_ rather than getting the results from a predict function. But as long as there's no such common functionality, I think enforcing some common interface (labels_ in this case) is a sensible option.

My suggestion in the case of OPTICS is the following:

clust = OPTICS(extract_method=...)
# calculate CD/RD, store labels_ of specific extraction method
clust.fit(X)
# don't run any extraction method, just calculate CD/RD/ordering:
clust.calculate_CD_RD_ordering(X)
# try different extraction methods/different parameters
# without calculating CD/RD again
clust.extract_XXX(...=...)

@qinhanmin2014
Copy link
Member

My suggestion in the case of OPTICS is the following:

I see, seems that both solutions are similar. I think they're all acceptable from my side.

@qinhanmin2014
Copy link
Member

Honestly I'm not sure whether it's a good idea to (and how to) assign labels to each point in extraci_xi and extract_sqlnk. I think @adrinjalali 's method for extract_xi in #12077 is different from the method we already have in extract_sqlnk and both methods ignore the hierarchical structure of the clusters.

@adrinjalali
Copy link
Member

In #12077, the _xi_cluster extracts the hierarchy of clusters, then _extract_xi_labels extract the labels from the hierarchy. Right now it assigns labels according to what would be minimum=True in R's extractXi. Implementing the equivalent of minimum=False is almost trivial. Same goes for extract_sqlnk.

Another aspect to assigning labels is trimming the tree at a certain level, which we can also easily implement.

P.S. I'm waiting for #12087 to get in before I go over the code in #12077 for a cleanup/code improvements.

@qinhanmin2014
Copy link
Member

Right now it assigns labels according to what would be minimum=True in R's extractXi.

Thanks @adrinjalali , I'll take some time to look at R, but I think we should use a same strategy to extract labels in extract_xi and extract_sqlnk.

Regarding #12087, I'd like to see #12421 in first to avoid unnecessary conflicts.
One thing bother me is the correctness of extract_sqlnk (See #12375, apart from some mysterious params, I also find a bug). I'm now considering whether we should revert extract_sqlnk (in 0.21).

@espg
Copy link
Contributor
espg commented Oct 21, 2018

This may not have much bearing on sklearn's API, but I would point out that the OPTICS algorithm itself is defined to just produce Reachability and Ordering, not labels... of course, clustering modules in sklearn are expected to produce labels, hence why we default to an extraction

@qinhanmin2014
Copy link
Member

This may not have much bearing on sklearn's API, but I would point out that the OPTICS algorithm itself is defined to just produce Reachability and Ordering, not labels...

This is why I don't like to provide labels in fit :) But seems that Joel cares about API consistency, so maybe we'll still provide labels after fit.

@jnothman
Copy link
Member Author
jnothman commented Oct 23, 2018 via email

@qinhanmin2014
Copy link
Member

great, I’m also considering removing sqlnk these days
I’m worring about 7, do you have any reference to assign labels in this way?

@espg
Copy link
Contributor
espg commented Feb 25, 2019

@jnothman I think that API summary looks good. For item 6, I'd recommend compute_optics_graph

@qinhanmin2014
Copy link
Member

I'm going to review #12087 shortly.

Who's going to finish it? @adrinjalali or someone else?
Do we need a branch to hold the OPTICS changes?

I’m worrying about 7, do you have any reference to assign labels in this way?

E.g., if there's a big cluster and a small cluster (e.g., (1, 10), (4, 5)), then all the points not in the small cluster (i.e., 1-3, 6-10) will be tagged as noise?

@jnothman
Copy link
Member Author
jnothman commented Feb 26, 2019 via email

@qinhanmin2014
Copy link
Member

With small clusters and noise the idea is that the user needs to be guided in how parameters control minimal cluster size.

Is it possible? This seems magical :)
And I guess you are not answering my question? My question is whether it's reasonable to assign labels in this way.

But I was very sleep deprived yesterday!

I'm surprised when I saw you editing the wiki in the late night. Wish you a good night's sleep!

@jnothman
Copy link
Member Author

I think that the one I mean to review is actually #12077 (extract_xi). #12087 needs to be reshaped to match the above parameter names, functional design, etc.

Should we rename 'xi' to 'xi_steep'?

@jnothman
Copy link
Member Author

I'm surprised when I saw you editing the wiki in the late night. Wish you a good night's sleep!

Ahh... You see too much. I was awake a couple of hours in the night. Read a short story. Updated a wiki. Etc :)

@qinhanmin2014
Copy link
Member

I think that the one I mean to review is actually #12077 (extract_xi). #12087 needs to be reshaped to match the above parameter names, functional design, etc.

Maybe we should settle the API first?

And another question @jnothman I guess we don't need the optics function since we're going to provide compute_optics_order + cluster_optics_{METHOD} right?

@qinhanmin2014
Copy link
Member

Should we rename 'xi' to 'xi_steep'?

Both solutions are OK from my side.

@adrinjalali
Copy link
Member

@qinhanmin2014 #12087 resolves a bunch of these issues. Could you check that PR and let us know what you think in terms of API design?

@jnothman
Copy link
Member Author
jnothman commented Feb 26, 2019 via email

@qinhanmin2014
Copy link
Member
F438

I guess we can close?

@jnothman
Copy link
Member Author
jnothman commented Mar 6, 2019

We do not yet have it working with fit_predict, nor did we yet land on a solution for that afaik, so essentially this issue is not resolved.

@jnothman jnothman reopened this Mar 6, 2019
@qinhanmin2014
Copy link
Member

We do not yet have it working with fit_predict, nor did we yet land on a solution for that afaik, so essentially this issue is not resolved.

@jnothman We support fit_predict (through ClusterMixin) now?

@jnothman
Copy link
Member Author
jnothman commented Mar 7, 2019

The point is to easily be able to get multiple clusterings without recomputing optics.

@qinhanmin2014
Copy link
Member

The point is to easily be able to get multiple clusterings without recomputing optics.

Hmm, I guess we're going to rely on compute_optics_graph + cluster_optics_XXX?

@adrinjalali
Copy link
Member

Hmm, I guess we're going to rely on compute_optics_graph + cluster_optics_XXX?

That's one way, the other way is to have fit + warm_start.

@qinhanmin2014
Copy link
Member

That's one way, the other way is to have fit + warm_start.

Seems that I wrongly assume that you've decided to rely on compute_optics_graph + cluster_optics_XXX. I'm OK to support warm start (with examples to demonstrate its usage).

@jnothman
Copy link
Member Author
jnothman commented Mar 7, 2019 via email

@qinhanmin2014
Copy link
Member

The point of this issue is that there should be a class-based approach.

What's the application scenario? @jnothman @adrinjalali

@jnothman
Copy link
Member Author
jnothman commented Mar 12, 2019 via email

@qinhanmin2014
Copy link
Member

The goal would be to allow the user to efficiently evaluate many different eps, as they would in grid search for supervised learning.

But GridSearchCV will still recompute RD/CD multiple times? I'm still unable to understand why a warm_start parameter will make GridSearchCV more efficient.

Untagging this from 0.21. Retag if you disagree.

@qinhanmin2014 qinhanmin2014 removed this from the 0.21 milestone Mar 12, 2019
@jnothman
Copy link
Member Author
jnothman commented Mar 12, 2019 via email

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 12, 2019 via email

@jnothman
Copy link
Member Author
jnothman commented Mar 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0