API for getting DBSCAN-like clusterings out of OPTICS with `fit_predict` #12044

jnothman · 2018-09-08T23:13:49Z

Currently we have an interface for OPTICS with custom method extract_dbscan. This is good for usability and visibility of the functionality, but means that a generic parameter search tool (like GridSearchCV) can't use OPTICS to perform DBSCAN at various eps.

This would involve adding an eps parameter which, when None, would use the default OPTICS clustering; when not None would use extract_dbscan. But we would also need to retain the model across multiple fits...

Here are two alternative interfaces:

Add a warm_start parameter (like many classifiers, regressors, but uncharted territory for clusterers). When True, and fit or fit_predict is called, the current reachability_, ordering_ and core_distances_ would be kept, but a different final clustering step would be used to output / store labels_.
Add a memory parameter, like in hierarchical clustering. This would cache the mapping from parameters to reachability_, ordering_ and core_distances_ using a joblib.Memory.

I think the first option sounds more appropriate.

The text was updated successfully, but these errors were encountered:

Sriharsha-hatwar · 2018-09-10T04:27:04Z

Hey @jnothman I would like to take up this, is there any basic requirement that is already there in the code base that I need to be acquainted with ?

adrinjalali · 2018-09-10T11:21:48Z

@jnothman, I think #12036 is kinda related to this one.

jnothman · 2018-09-12T04:09:53Z

Well, I wouldn't think of this as a good first issue, @Sriharsha-hatwar. Please try contribute to something more straightforward until you are familiar with our contribution process.

@adrinjalali, yes it is... I'm yet to read that.

ghost · 2018-09-12T04:27:21Z

I have worked on a documentation issue earlier. Can I take this issue? If yes, I will go through the issue and code to make sure that I can handle it.

jnothman · 2018-09-12T05:46:24Z

Do you feel comfortable with what this issue is suggesting?

ghost · 2018-09-13T12:50:05Z

I am not very comfortable yet but I am trying to read the background to understand it further. Meanwhile, if someone else pops up to fix this issue they can take it.

adrinjalali · 2018-09-15T12:00:55Z

Doesn't this apply to all extraction methods? The warm_start would apply to the first stage (_extract_optics_ordering), and the rest can use the calculated values independent of the extraction method.

I guess what I'm asking, is that, should we then add all parameters related to extract methods to the fit method?

jnothman · 2018-09-16T00:20:06Z

To the class, not the fit method, but yes. I hope this is indeed an appropriate use of warm_start

qinhanmin2014 · 2018-10-15T01:17:25Z

I think we need to make decision here ASAP since we've promised OPTICS in 0.21.

My main points for API design are as follows:
(1) Users don't need to calculate RD/CD again if they want to extract clusters with different methods/different parameters.
(2) If possible, provide users with a way to only calculate RD/CD (e.g., method=None). Actually, OPTICS itself only returns RD/CD.

Regarding warm_start, the problem is, if we reuse RD/CD, we need to ensure that users pass in the same X. Is it possible for warm_start?

Maybe an alternative solution is not including fit_predict and predict here, so we'll have fit (calculates RD/CD), extract_dbscan, extract_.... But the problem here is that it seems strange that a cluster algorithm doesn't have both fit_predict and predict.

jnothman · 2018-10-15T07:03:19Z

I think we need to provide fit_predict and labels_ for API consistency and just note that this is something we do poorly. (I hope we are already aware that we have a problem of efficiently reusing sufficient statistics with different aggregates through a constant interface. We have seen similar API problems in scoring, feature importances, grid searches over transformation pipelines, etc. This is part of why a tool like dask frames computation in terms of a directed acyclic graph.) I think the strange thing about reusing warm_start here is the notion that we often want warm_start=True for maximum usability. We could also consider providing the caching of X -> Rd/cd without providing an option, but this would be a new design for scikit-learn. Checking whether it's an identical X is something we could achieve approximately with a hash of the data if we really want. Hashing would be relatively slow but not nearly as slow as recomputing optics.

adrinjalali · 2018-10-15T08:24:59Z

I also think OPTICS as a class in clustering should satisfy the usual API requirements.

Once we have an extract_"method name" function for the three extraction methods, users can use the calculated RD/CD from OPTICS to extract clusters without having to run fit.

We can add a public calculate_OPTICS_RD_CD function or whatever and not run the clustering part of the algorithm. However, I wouldn't worry about this part since all the extract methods are of O(n). We almost have that function already anyway.

We can have a naive warm_start for now, which would basically ignore X if RD/CD are present already. I'm yet to read the other warm_start examples we have in the code base, but do they check for having the same X as the input as well?

I'd see the hashing/caching mechanism as a separate project, which would affect probably most classes we have. Since we already do have OPTICS pretty modular, I guess adopting the mechanism would be easy here.

jnothman · 2018-10-15T09:29:28Z

Caching would not affect most other classes, in part because they will often be run with all sorts of parameters changed. joblib.Memory can do the caching, but does unwanted things like actually writing the data to dish... No, I don't think we have generally needed/wanted to check for identical data in other warm_start cases. But warm_start has defaulted to false with the assumption that the user knows what they're doing if they switch it on. In particular, using warm_start makes sense for varying some parameters and not others, but this tends to be very sparsely documented.... I think providing both functions and a method parameter is a great start in terms of API. We can then memoise with or without warm_start without breaking API.

qinhanmin2014 · 2018-10-20T08:38:09Z

I think we need to provide fit_predict and labels_ for API consistency and just note that this is something we do poorly.

I have to admit that I can't understand the importance of (such a strict) API consistency now. Following Joel's decision, I think I'm fine with things like #12087 (i.e., provide a extract_method parameter so that we can provide labels_ after fit/fit_predict, and provide function extract_XXX to users so that they can easily try different methods without calculating RD again.)
WDYT? @jnothman

qinhanmin2014 · 2018-10-20T09:10:43Z

And maybe we can provide extract_method = None (i.e., doesn't return labels after fit, raise an error for fit_predict). This will enable users to get CD/RD quickly and make it easier for us to benchmark the algorithm.

adrinjalali · 2018-10-20T10:25:18Z

I find having an extract_RD function or something more elegant than extract_method=None and breaking the API.

qinhanmin2014 · 2018-10-20T11:46:11Z

breaking the API.

Why will extract_method = None break the API? I think it's acceptable that when certain param is set to certain value, certain attribute will not be available (e.g., store_cv_values in RidgeClassifierCV).

I find having an extract_RD function

How will you implement it?

My solution is something like:

clust = OPTICS(extract_method=...)
# calculate CD/RD, return labels_ of specific extraction method
clust.fit(X)
# try different extraction methods/different parameters
# without calculating CD/RD again
clust.extract_XXX(...=...)

adrinjalali · 2018-10-20T12:32:07Z

To me, OPTICS is a "clustering" algorithm, and I expect to be able to use any such model in a pipeline where using any clustering is apt. For instance, I would expect to be able to easily use it in a grid search, with a Silhouette score to optimize hyperparameters. That means I expect the object to be in a valid state after a call to fit, which for a clustering object means to have labels_ set.

store_cv_values in RidgeClassifierCV doesn't break the API, cause not many (if any) other modules depend on that variable to be set to function properly; in contrast, a scoring function would expect a clustering object to have a valid labels_ property, hence not having it breaks the API.

That said, I personally find it odd that not all clusterings have predict function implemented. It feels odd to me to rely on labels_ rather than getting the results from a predict function. But as long as there's no such common functionality, I think enforcing some common interface (labels_ in this case) is a sensible option.

My suggestion in the case of OPTICS is the following:

clust = OPTICS(extract_method=...)
# calculate CD/RD, store labels_ of specific extraction method
clust.fit(X)
# don't run any extraction method, just calculate CD/RD/ordering:
clust.calculate_CD_RD_ordering(X)
# try different extraction methods/different parameters
# without calculating CD/RD again
clust.extract_XXX(...=...)

qinhanmin2014 · 2018-10-20T12:45:01Z

My suggestion in the case of OPTICS is the following:

I see, seems that both solutions are similar. I think they're all acceptable from my side.

qinhanmin2014 · 2018-10-21T02:21:58Z

Honestly I'm not sure whether it's a good idea to (and how to) assign labels to each point in extraci_xi and extract_sqlnk. I think @adrinjalali 's method for extract_xi in #12077 is different from the method we already have in extract_sqlnk and both methods ignore the hierarchical structure of the clusters.

adrinjalali · 2018-10-21T08:45:48Z

In #12077, the _xi_cluster extracts the hierarchy of clusters, then _extract_xi_labels extract the labels from the hierarchy. Right now it assigns labels according to what would be minimum=True in R's extractXi. Implementing the equivalent of minimum=False is almost trivial. Same goes for extract_sqlnk.

Another aspect to assigning labels is trimming the tree at a certain level, which we can also easily implement.

P.S. I'm waiting for #12087 to get in before I go over the code in #12077 for a cleanup/code improvements.

qinhanmin2014 · 2018-10-21T08:58:58Z

Right now it assigns labels according to what would be minimum=True in R's extractXi.

Thanks @adrinjalali , I'll take some time to look at R, but I think we should use a same strategy to extract labels in extract_xi and extract_sqlnk.

Regarding #12087, I'd like to see #12421 in first to avoid unnecessary conflicts.
One thing bother me is the correctness of extract_sqlnk (See #12375, apart from some mysterious params, I also find a bug). I'm now considering whether we should revert extract_sqlnk (in 0.21).

espg · 2018-10-21T15:37:39Z

This may not have much bearing on sklearn's API, but I would point out that the OPTICS algorithm itself is defined to just produce Reachability and Ordering, not labels... of course, clustering modules in sklearn are expected to produce labels, hence why we default to an extraction

qinhanmin2014 · 2018-10-21T15:49:09Z

This may not have much bearing on sklearn's API, but I would point out that the OPTICS algorithm itself is defined to just produce Reachability and Ordering, not labels...

This is why I don't like to provide labels in fit :) But seems that Joel cares about API consistency, so maybe we'll still provide labels after fit.

jnothman · 2018-10-23T07:02:38Z

So let's provide an optics_distances function that just produces RD/CD. Or an optics function for same but rename the class to OpticsClustering or OPTICSClustering. The main point of the class interface for estimators is to provide API consistency around diverse underlying functional forms. It is appropriate to make the estimator API for OPTICS satisfy the clustering API by default.

qinhanmin2014 · 2019-02-25T15:55:07Z

great, I’m also considering removing sqlnk these days
I’m worring about 7, do you have any reference to assign labels in this way?

espg · 2019-02-25T22:38:59Z

@jnothman I think that API summary looks good. For item 6, I'd recommend compute_optics_graph

qinhanmin2014 · 2019-02-26T07:51:41Z

I'm going to review #12087 shortly.

Who's going to finish it? @adrinjalali or someone else?
Do we need a branch to hold the OPTICS changes?

I’m worrying about 7, do you have any reference to assign labels in this way?

E.g., if there's a big cluster and a small cluster (e.g., (1, 10), (4, 5)), then all the points not in the small cluster (i.e., 1-3, 6-10) will be tagged as noise?

jnothman · 2019-02-26T08:07:39Z

Adrin is focusing on NOCATS, but Assia will get his/my mentorship on #12087. I have found it hard to concentrate on any one thing at the sprint so far... But I was very sleep deprived yesterday! With small clusters and noise the idea is that the user needs to be guided in how parameters control minimal cluster size.

qinhanmin2014 · 2019-02-26T08:18:43Z

With small clusters and noise the idea is that the user needs to be guided in how parameters control minimal cluster size.

Is it possible? This seems magical :)
And I guess you are not answering my question? My question is whether it's reasonable to assign labels in this way.

But I was very sleep deprived yesterday!

I'm surprised when I saw you editing the wiki in the late night. Wish you a good night's sleep!

jnothman · 2019-02-26T09:31:08Z

I think that the one I mean to review is actually #12077 (extract_xi). #12087 needs to be reshaped to match the above parameter names, functional design, etc.

Should we rename 'xi' to 'xi_steep'?

jnothman · 2019-02-26T09:32:11Z

I'm surprised when I saw you editing the wiki in the late night. Wish you a good night's sleep!

Ahh... You see too much. I was awake a couple of hours in the night. Read a short story. Updated a wiki. Etc :)

qinhanmin2014 · 2019-02-26T09:43:19Z

I think that the one I mean to review is actually #12077 (extract_xi). #12087 needs to be reshaped to match the above parameter names, functional design, etc.

Maybe we should settle the API first?

And another question @jnothman I guess we don't need the optics function since we're going to provide compute_optics_order + cluster_optics_{METHOD} right？

qinhanmin2014 · 2019-02-26T09:51:38Z

Should we rename 'xi' to 'xi_steep'?

Both solutions are OK from my side.

adrinjalali · 2019-02-26T10:03:57Z

@qinhanmin2014 #12087 resolves a bunch of these issues. Could you check that PR and let us know what you think in terms of API design?

jnothman · 2019-02-26T17:30:32Z

Yes, no need for the optics function.

qinhanmin2014 · 2019-03-04T05:18:42Z

F438

I guess we can close?

jnothman · 2019-03-06T17:50:56Z

We do not yet have it working with fit_predict, nor did we yet land on a solution for that afaik, so essentially this issue is not resolved.

qinhanmin2014 · 2019-03-07T11:40:17Z

We do not yet have it working with fit_predict, nor did we yet land on a solution for that afaik, so essentially this issue is not resolved.

@jnothman We support fit_predict (through ClusterMixin) now?

jnothman · 2019-03-07T12:06:29Z

The point is to easily be able to get multiple clusterings without recomputing optics.

qinhanmin2014 · 2019-03-07T12:30:01Z

The point is to easily be able to get multiple clusterings without recomputing optics.

Hmm, I guess we're going to rely on compute_optics_graph + cluster_optics_XXX?

adrinjalali · 2019-03-07T12:56:50Z

Hmm, I guess we're going to rely on compute_optics_graph + cluster_optics_XXX?

That's one way, the other way is to have fit + warm_start.

qinhanmin2014 · 2019-03-07T13:15:06Z

That's one way, the other way is to have fit + warm_start.

Seems that I wrongly assume that you've decided to rely on compute_optics_graph + cluster_optics_XXX. I'm OK to support warm start (with examples to demonstrate its usage).

jnothman · 2019-03-07T21:01:30Z

The point of this issue is that there should be a class-based approach.

qinhanmin2014 · 2019-03-12T04:01:13Z

The point of this issue is that there should be a class-based approach.

What's the application scenario? @jnothman @adrinjalali

jnothman · 2019-03-12T06:56:26Z

While the xi method considers variable density in extraction, optics allows the user to do multiple dbscan extractions, i.e. fixed density. The goal would be to allow the user to efficiently evaluate many different eps, as they would in grid search for supervised learning. This is admittedly quite low priority as long as parameter search for clustering remains on the backburner.

qinhanmin2014 · 2019-03-12T07:47:20Z

The goal would be to allow the user to efficiently evaluate many different eps, as they would in grid search for supervised learning.

But GridSearchCV will still recompute RD/CD multiple times? I'm still unable to understand why a warm_start parameter will make GridSearchCV more efficient.

Untagging this from 0.21. Retag if you disagree.

jnothman · 2019-03-12T09:29:36Z

Definitely fine to clear milestone

GaelVaroquaux · 2019-03-12T20:08:05Z

For hierarchical clustering (where computing the tree is much more expensive than cutting it for a given number of clusters), my lab has been successfully using joblib memory to answer such needs.

jnothman · 2019-03-12T22:30:04Z

Yes, adding a memory parameter is one of the options here, and perhaps the simplest and most consistent with other clustering, i.e. hierarchical.

jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Sep 8, 2018

adrinjalali mentioned this issue Sep 15, 2018

[MRG+1] OPTICS fit uses the selected extract_method parameter #12087

Merged

jnothman mentioned this issue Oct 14, 2018

OPTICS Return all the clusters found by the extraction algorithm #12376

Closed

qinhanmin2014 added this to the 0.21 milestone Oct 15, 2018

qinhanmin2014 closed this as completed Mar 4, 2019

jnothman reopened this Mar 6, 2019

qinhanmin2014 removed this from the 0.21 milestone Mar 12, 2019

frankier mentioned this issue Dec 17, 2020

ENH Add memory=joblib.Memory param to OPTICS #19024

Merged

rth closed this as completed in #19024 Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for getting DBSCAN-like clusterings out of OPTICS with `fit_predict` #12044

API for getting DBSCAN-like clusterings out of OPTICS with `fit_predict` #12044

API for getting DBSCAN-like clusterings out of OPTICS with fit_predict #12044

API for getting DBSCAN-like clusterings out of OPTICS with fit_predict #12044

Comments

API for getting DBSCAN-like clusterings out of OPTICS with `fit_predict` #12044

API for getting DBSCAN-like clusterings out of OPTICS with `fit_predict` #12044