-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
OPTICS extraction methods #12036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@adrinjalali This is a good overview, and I think that it captures the OPTICS algorithm and implementation well. The issue that you opened #11929 deals with a test that currently serves two purposes:
Currently, the test in #11929 fails due to numerical precision issues when using 32bit installs. Switching to 50 (instead of 250) points per cluster fixes numerical issue between builds, but breaks the second purpose listed above. The test is still valuable for the first purpose above (i.e., ensuring our implementation doesn't 'drift' without our knowledge). I'm not sure how important the second point is for us... but if it is needed, we may want to check the reference 'chemometria' implementation. The python implementation from Brian Clowers is a port of the original code (written in MATLAB) to python. The port doesn't have any unit tests or other checks, so differences at lower point counts might be a bug in the port that isn't present in the original code... |
I'm confused about the relevance of your comment to the present issue,
@espg, which is not at all about the correctness of the current
implementation, but is proposing to also implement xi extraction from
reachability, and to access labels for all extraction methods through the
standard interfaces rather than a custom extract_* method.
I don't know how valuable the xi extraction method is... Do you have some
sense of whether it is entirely superseded by sqlnk, and how tricky an
implementation would be?
|
The xi extraction method is the one included in R and ELKI. In a sense, I suppose it's the one people who are familiar with OPTICS, expect to have as the standard extraction method. It is not superseded by sqlnk IMO. sqlnk detects better hierarchies when clusters of higher densities are embodied inside clusters of lower densities, but it labels more points as noise if we take the lowest level of the hierarchy (which we do by default). It also happened to us at the beginning in #11677, that we were comparing the results of our implementation with R, which uses xi or just DBSCAN, and of course the results are not comparable. From what I understand, it should be easier to implement than sqlnk, and we have the R implementation in the R package. I'm guessing it shouldn't be too hard to port it to python. |
Most R packages are not released under a compatible licence, and indeed
dbscan is GPL and we cannot use its code without asking the authors of any
relevant code to relicense it more permissively.
|
Does that mean I need to implement the |
Yes, I think porting by looking at the code would be counted as a
derivative work and would have to be GPL licensed
|
Question: Which one is preferred?
|
The union, I think. Please list the respective parameters, though
|
Related
This is related to some issues we discussed and encountered in:
Intro
To continue the discussion, it is important to distinguish between two main tasks included in the
OPTICS
module/class:There's only one way/definition to extract the reachability plot the OPTICS way [1]. But there are multiple ways to extract the clusters, or the hierarchy of clusters from the reachability plot produced by OPTICS. The original OPTICS article [1] suggests two methods to extract the clusters:
eps
[1, Fig. 8]. This method is often referred to asextract_dbscan
, since there exists a DBSCAN with a set of parameters which gives the same result.extract_xi
(R, ELKI).There are other methods to extract clusters from a given reachability plot (or a dendrogram), and the one included in our OPTICS's fit right now is the one introduced in [2] which is well cited. Let's call this method
extract_SQLNK
(for the lack of a name, taking the first character of the authors' surnames). The article argues that it produces better clusters thanextract_xi
, with some empirically found default values for the parameters.Current Status
At the moment, calculating the reachability plot is taken from @espg's implementation [3], and
extract_SQLNK
(not with this name though) is taken from @amyxzhang's implementatoin [4].Although the OPTICS class includes an
extract_dbscan
method, it's not the one used in thefit
function. Thefit
function calculates the reachability plot and then usesextract_SQLNK
to extract the clusters.Right not the code does not include an implementation of
extract_xi
.Proposal
We need 3
extract_*
methods corresponding to the above 3 methods. Two of them are already implemented. We only need to implement theextract_xi
.The constructor takes a parameter as
extract_method
which specifies which extract method should be called infit
.The
extract_*
methods run fast (O(n)) once the reachability plot is calculated. So it makes sense to have them as public functions and let the user call them afterfit
. Thefit
function uses the output of one of theseextract_*
functions to set thelabels_
. Calling anextract_*
function explicitly, won't change any attribute of the class.The
extract_xi
andextract_SQLNK
methods are allowed to have fixes and additions on top of what's proposed in the original articles, with the same spirit that it's done in @amyxzhang's implementation and in R [5]:@amyxzhang, @espg: is the above analysis sound to you? Am I missing something?
@jnothman, @amueller, @rth, @qinhanmin2014 (and others), what do you think?
[1] OPTICS: Ordering Points To Identify the Clustering Structure
[2] Automatic Extraction of Clusters from Hierarchical Clustering Representations
[3] https://github.com/espg/OPTICS
[4] https://github.com/amyxzhang/OPTICS-Automatic-Clustering
[5] https://rdrr.io/cran/dbscan/man/optics.html
The text was updated successfully, but these errors were encountered: