-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
API for getting DBSCAN-like clusterings out of OPTICS with fit_predict
#12044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @jnothman I would like to take up this, is there any basic requirement that is already there in the code base that I need to be acquainted with ? |
Well, I wouldn't think of this as a good first issue, @Sriharsha-hatwar. Please try contribute to something more straightforward until you are familiar with our contribution process. @adrinjalali, yes it is... I'm yet to read that. |
I have worked on a documentation issue earlier. Can I take this issue? If yes, I will go through the issue and code to make sure that I can handle it. |
Do you feel comfortable with what this issue is suggesting? |
I am not very comfortable yet but I am trying to read the background to understand it further. Meanwhile, if someone else pops up to fix this issue they can take it. |
Doesn't this apply to all extraction methods? The warm_start would apply to the first stage ( I guess what I'm asking, is that, should we then add all parameters related to extract methods to the |
To the class, not the fit method, but yes. I hope this is indeed an
appropriate use of warm_start
|
I think we need to make decision here ASAP since we've promised OPTICS in 0.21. My main points for API design are as follows: Regarding Maybe an alternative solution is not including |
I think we need to provide fit_predict and labels_ for API consistency and
just note that this is something we do poorly. (I hope we are already aware
that we have a problem of efficiently reusing sufficient statistics with
different aggregates through a constant interface. We have seen similar API
problems in scoring, feature importances, grid searches over transformation
pipelines, etc. This is part of why a tool like dask frames computation in
terms of a directed acyclic graph.)
I think the strange thing about reusing warm_start here is the notion that
we often want warm_start=True for maximum usability. We could also consider
providing the caching of X -> Rd/cd without providing an option, but this
would be a new design for scikit-learn.
Checking whether it's an identical X is something we could achieve
approximately with a hash of the data if we really want. Hashing would be
relatively slow but not nearly as slow as recomputing optics.
|
I also think OPTICS as a class in Once we have an We can add a public We can have a naive I'd see the hashing/caching mechanism as a separate project, which would affect probably most classes we have. Since we already do have OPTICS pretty modular, I guess adopting the mechanism would be easy here. |
Caching would not affect most other classes, in part because they will
often be run with all sorts of parameters changed. joblib.Memory can do the
caching, but does unwanted things like actually writing the data to dish...
No, I don't think we have generally needed/wanted to check for identical
data in other warm_start cases. But warm_start has defaulted to false with
the assumption that the user knows what they're doing if they switch it on.
In particular, using warm_start makes sense for varying some parameters and
not others, but this tends to be very sparsely documented....
I think providing both functions and a method parameter is a great start in
terms of API. We can then memoise with or without warm_start without
breaking API.
|
I have to admit that I can't understand the importance of (such a strict) API consistency now. Following Joel's decision, I think I'm fine with things like #12087 (i.e., provide a |
And maybe we can provide |
I find having an |
Why will
How will you implement it? My solution is something like:
|
To me,
That said, I personally find it odd that not all clusterings have My suggestion in the case of
|
I see, seems that both solutions are similar. I think they're all acceptable from my side. |
Honestly I'm not sure whether it's a good idea to (and how to) assign labels to each point in |
In #12077, the Another aspect to assigning labels is trimming the tree at a certain level, which we can also easily implement. P.S. I'm waiting for #12087 to get in before I go over the code in #12077 for a cleanup/code improvements. |
Thanks @adrinjalali , I'll take some time to look at R, but I think we should use a same strategy to extract labels in extract_xi and extract_sqlnk. Regarding #12087, I'd like to see #12421 in first to avoid unnecessary conflicts. |
This may not have much bearing on sklearn's API, but I would point out that the OPTICS algorithm itself is defined to just produce Reachability and Ordering, not labels... of course, clustering modules in sklearn are expected to produce labels, hence why we default to an extraction |
This is why I don't like to provide labels in |
So let's provide an optics_distances function that just produces RD/CD. Or
an optics function for same but rename the class to OpticsClustering or
OPTICSClustering. The main point of the class interface for estimators is
to provide API consistency around diverse underlying functional forms. It
is appropriate to make the estimator API for OPTICS satisfy the clustering
API by default.
|
great, I’m also considering removing sqlnk these days |
@jnothman I think that API summary looks good. For item 6, I'd recommend |
Who's going to finish it? @adrinjalali or someone else?
E.g., if there's a big cluster and a small cluster (e.g., (1, 10), (4, 5)), then all the points not in the small cluster (i.e., 1-3, 6-10) will be tagged as noise? |
Adrin is focusing on NOCATS, but Assia will get his/my mentorship on
#12087. I have found it hard to concentrate on any one thing at the sprint
so far... But I was very sleep deprived yesterday!
With small clusters and noise the idea is that the user needs to be guided
in how parameters control minimal cluster size.
|
Is it possible? This seems magical :)
I'm surprised when I saw you editing the wiki in the late night. Wish you a good night's sleep! |
Ahh... You see too much. I was awake a couple of hours in the night. Read a short story. Updated a wiki. Etc :) |
Maybe we should settle the API first? And another question @jnothman I guess we don't need the |
Both solutions are OK from my side. |
@qinhanmin2014 #12087 resolves a bunch of these issues. Could you check that PR and let us know what you think in terms of API design? |
Yes, no need for the optics function.
|
I guess we can close? |
F438
We do not yet have it working with fit_predict, nor did we yet land on a solution for that afaik, so essentially this issue is not resolved. |
@jnothman We support fit_predict (through ClusterMixin) now? |
The point is to easily be able to get multiple clusterings without recomputing optics. |
Hmm, I guess we're going to rely on |
That's one way, the other way is to have fit + warm_start. |
Seems that I wrongly assume that you've decided to rely on |
The point of this issue is that there should be a class-based approach.
|
What's the application scenario? @jnothman @adrinjalali |
While the xi method considers variable density in extraction, optics allows
the user to do multiple dbscan extractions, i.e. fixed density. The goal
would be to allow the user to efficiently evaluate many different eps, as
they would in grid search for supervised learning.
This is admittedly quite low priority as long as parameter search for
clustering remains on the backburner.
|
But Untagging this from 0.21. Retag if you disagree. |
Definitely fine to clear milestone
|
For hierarchical clustering (where computing the tree is much more
expensive than cutting it for a given number of clusters), my lab has
been successfully using joblib memory to answer such needs.
|
Yes, adding a memory parameter is one of the options here, and perhaps
the simplest and most consistent with other clustering, i.e.
hierarchical.
|
Currently we have an interface for OPTICS with custom method
extract_dbscan
. This is good for usability and visibility of the functionality, but means that a generic parameter search tool (likeGridSearchCV
) can't use OPTICS to perform DBSCAN at variouseps
.This would involve adding an
eps
parameter which, when None, would use the default OPTICS clustering; when not None would useextract_dbscan
. But we would also need to retain the model across multiple fits...Here are two alternative interfaces:
warm_start
parameter (like many classifiers, regressors, but uncharted territory for clusterers). When True, andfit
orfit_predict
is called, the currentreachability_
,ordering_
andcore_distances_
would be kept, but a different final clustering step would be used to output / storelabels_
.memory
parameter, like in hierarchical clustering. This would cache the mapping from parameters toreachability_
,ordering_
andcore_distances_
using ajoblib.Memory
.I think the first option sounds more appropriate.
The text was updated successfully, but these errors were encountered: