10000 OPTICS Return all the clusters found by the extraction algorithm · Issue #12376 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

OPTICS Return all the clusters found by the extraction algorithm #12376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue Oct 14, 2018 · 10 comments · Fixed by #12077
Closed

OPTICS Return all the clusters found by the extraction algorithm #12376

qinhanmin2014 opened this issue Oct 14, 2018 · 10 comments · Fixed by #12077

Comments

@qinhanmin2014
Copy link
Member

In OPTICS, the automatic extraction algorithm actually returns all possible clusters, but currently we only provide users with the clusters on the leaves. I guess we can provide users with an option to get all these clusters. (R provides clusters_xi attribute).
ping @espg @adrinjalali (I think it's even more important for extract_Xi)

@adrinjalali
Copy link
Member

Both extract_sqlnk and extract_xi provide a hierarchy of clusters. I've already put return_clusters as a parameter to extract_xi in #12077. We can also extract the cluster hierarchy from the tree constructed by sqlnk.

I'd say we can add store_cluster_hierarchy (or something) to the OPTICS's init, and store the hierarchy in fit accordingly.

@qinhanmin2014
Copy link
Member Author

Thanks. I've gone through your PR and this issue is for so-called sqlnk
We've correct our RD and I'll submit a PR to correct CD in a couple of days. At this point, an important thing in my mind is to decide the API design of OPTICS. If we use extract_method, then if users want to extract clusters with different methods, they'll need to refit the model (i.e., compute CD and RD again).

@qinhanmin2014
Copy link
Member Author

And another thing is the way we extract labels for each sample from the hierarchy structure. In sqlnk, we only consider the leaves, and I doubt whether it's correct. E.g., I guess there might be some unexpected behavior if one leaf of a node is pruned. AFAIK, R dbscan also consider the parent node. I'll need some time to think about it carefully and your insights on these questions are deeply appreciated.

@jnothman
Copy link
Member
jnothman commented Oct 14, 2018 via email

@adrinjalali
Copy link
Member

The R OPTICS extract_xi has an option where you can tell if it should use the smallest clusters or also use the parent clusters when assigning the labels. The _xi_cluster in #12077 puts the clusters from smallest to largest in the reported cluster list, that's why _extract_xi_labels works fine. It'll be easy to also include the larger clusters in labeling if we want to. But that can also be seen as adding include_larger_clusters or something with default to False later as an enhancement without breaking backward compatibility.

@jnothman
Copy link
Member

Return the leaves but store the hierarchy on the estimator??? Return at a specified max depth according to some parameter?

@jnothman
Copy link
Member
jnothman commented Oct 16, 2018

Or prune to a specified n_clusters

@adrinjalali
Copy link
Member

It's more like calculate the labels_ according to the leaves rather than returning the leaves. And yes, to store the hierarchy in the estimator. Does "???" mean you really don't like it? I guess I've seen it in other estimators, like store_covariances or something.

We can start by calculating the labels_ according to the leaves (or smallest found clusters), then add storing the hierarchy, and helper functions and parameters to fine tune extraction of labels_ from the hierarchy.

@jnothman
Copy link
Member
jnothman commented Oct 16, 2018 via email

@espg
Copy link
Contributor
espg commented Oct 19, 2018

We should absolutely provide a way to encapsulate multiple cluster hierarchies. I don't know how to do it, especially within the current sklearn API, but I certainly support enabling it as a feature...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0