Description
Related
This is related to some issues we discussed and encountered in:
- OPTICS detecting the wrong outlier (Issue OPTICS detecting the wrong outlier #11677)
- fixes OPTICS split_points detected as NOISE and last point not detected as outlier (PR [WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857)
Intro
To continue the discussion, it is important to distinguish between two main tasks included in the OPTICS
module/class:
- calculate the reachability plot, which is almost equivalent to a dendrogram as shown in [2].
- extract clusters
There's only one way/definition to extract the reachability plot the OPTICS way [1]. But there are multiple ways to extract the clusters, or the hierarchy of clusters from the reachability plot produced by OPTICS. The original OPTICS article [1] suggests two methods to extract the clusters:
- extract clusters based on a hard threshold on
eps
[1, Fig. 8]. This method is often referred to asextract_dbscan
, since there exists a DBSCAN with a set of parameters which gives the same result. - extract clusters based on the steepness observed on the reachability plot [1, Sec. 4]. The main parameter to this method is ξ (xi), and the method is often called
extract_xi
(R, ELKI).
There are other methods to extract clusters from a given reachability plot (or a dendrogram), and the one included in our OPTICS's fit right now is the one introduced in [2] which is well cited. Let's call this method extract_SQLNK
(for the lack of a name, taking the first character of the authors' surnames). The article argues that it produces better clusters than extract_xi
, with some empirically found default values for the parameters.
Current Status
At the moment, calculating the reachability plot is taken from @espg's implementation [3], and extract_SQLNK
(not with this name though) is taken from @amyxzhang's implementatoin [4].
Although the OPTICS class includes an extract_dbscan
method, it's not the one used in the fit
function. The fit
function calculates the reachability plot and then uses extract_SQLNK
to extract the clusters.
Right not the code does not include an implementation of extract_xi
.
Proposal
We need 3 extract_*
methods corresponding to the above 3 methods. Two of them are already implemented. We only need to implement the extract_xi
.
The constructor takes a parameter as extract_method
which specifies which extract method should be called in fit
.
The extract_
5FBF
*
methods run fast (O(n)) once the reachability plot is calculated. So it makes sense to have them as public functions and let the user call them after fit
. The fit
function uses the output of one of these extract_*
functions to set the labels_
. Calling an extract_*
function explicitly, won't change any attribute of the class.
The extract_xi
and extract_SQLNK
methods are allowed to have fixes and additions on top of what's proposed in the original articles, with the same spirit that it's done in @amyxzhang's implementation and in R [5]:
The used algorithm was originally contributed by the ELKI framework, but contains a set of fixes.
@amyxzhang, @espg: is the above analysis sound to you? Am I missing something?
@jnothman, @amueller, @rth, @qinhanmin2014 (and others), what do you think?
[1] OPTICS: Ordering Points To Identify the Clustering Structure
[2] Automatic Extraction of Clusters from Hierarchical Clustering Representations
[3] https://github.com/espg/OPTICS
[4] https://github.com/amyxzhang/OPTICS-Automatic-Clustering
[5] https://rdrr.io/cran/dbscan/man/optics.html