8000 OPTICS extraction methods · Issue #12036 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
OPTICS extraction methods #12036
Closed
Closed
@adrinjalali

Description

@adrinjalali

Related

This is related to some issues we discussed and encountered in:

Intro

To continue the discussion, it is important to distinguish between two main tasks included in the OPTICS module/class:

  • calculate the reachability plot, which is almost equivalent to a dendrogram as shown in [2].
  • extract clusters

There's only one way/definition to extract the reachability plot the OPTICS way [1]. But there are multiple ways to extract the clusters, or the hierarchy of clusters from the reachability plot produced by OPTICS. The original OPTICS article [1] suggests two methods to extract the clusters:

  • extract clusters based on a hard threshold on eps [1, Fig. 8]. This method is often referred to as extract_dbscan, since there exists a DBSCAN with a set of parameters which gives the same result.
  • extract clusters based on the steepness observed on the reachability plot [1, Sec. 4]. The main parameter to this method is ξ (xi), and the method is often called extract_xi (R, ELKI).

There are other methods to extract clusters from a given reachability plot (or a dendrogram), and the one included in our OPTICS's fit right now is the one introduced in [2] which is well cited. Let's call this method extract_SQLNK (for the lack of a name, taking the first character of the authors' surnames). The article argues that it produces better clusters than extract_xi, with some empirically found default values for the parameters.

Current Status

At the moment, calculating the reachability plot is taken from @espg's implementation [3], and extract_SQLNK (not with this name though) is taken from @amyxzhang's implementatoin [4].

Although the OPTICS class includes an extract_dbscan method, it's not the one used in the fit function. The fit function calculates the reachability plot and then uses extract_SQLNK to extract the clusters.

Right not the code does not include an implementation of extract_xi.

Proposal

We need 3 extract_* methods corresponding to the above 3 methods. Two of them are already implemented. We only need to implement the extract_xi.

The constructor takes a parameter as extract_method which specifies which extract method should be called in fit.

The extract_ 5FBF * methods run fast (O(n)) once the reachability plot is calculated. So it makes sense to have them as public functions and let the user call them after fit. The fit function uses the output of one of these extract_* functions to set the labels_. Calling an extract_* function explicitly, won't change any attribute of the class.

The extract_xi and extract_SQLNK methods are allowed to have fixes and additions on top of what's proposed in the original articles, with the same spirit that it's done in @amyxzhang's implementation and in R [5]:

The used algorithm was originally contributed by the ELKI framework, but contains a set of fixes.

@amyxzhang, @espg: is the above analysis sound to you? Am I missing something?
@jnothman, @amueller, @rth, @qinhanmin2014 (and others), what do you think?

[1] OPTICS: Ordering Points To Identify the Clustering Structure
[2] Automatic Extraction of Clusters from Hierarchical Clustering Representations
[3] https://github.com/espg/OPTICS
[4] https://github.com/amyxzhang/OPTICS-Automatic-Clustering
[5] https://rdrr.io/cran/dbscan/man/optics.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0