8000 OPTICS extraction methods · Issue #12036 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

OPTICS extraction methods #12036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adrinjalali opened this issue Sep 7, 2018 · 8 comments · Fixed by #12087
Closed

OPTICS extraction methods #12036

adrinjalali opened this issue Sep 7, 2018 · 8 comments · Fixed by #12087

Comments

@adrinjalali
Copy link
Member

Related

This is related to some issues we discussed and encountered in:

Intro

To continue the discussion, it is important to distinguish between two main tasks included in the OPTICS module/class:

  • calculate the reachability plot, which is almost equivalent to a dendrogram as shown in [2].
  • extract clusters

There's only one way/definition to extract the reachability plot the OPTICS way [1]. But there are multiple ways to extract the clusters, or the hierarchy of clusters from the reachability plot produced by OPTICS. The original OPTICS article [1] suggests two methods to extract the clusters:

  • extract clusters based on a hard threshold on eps [1, Fig. 8]. This method is often referred to as extract_dbscan, since there exists a DBSCAN with a set of parameters which gives the same result.
  • extract clusters based on the steepness observed on the reachability plot [1, Sec. 4]. The main parameter to this method is ξ (xi), and the method is often called extract_xi (R, ELKI).

There are other methods to extract clusters from a given reachability plot (or a dendrogram), and the one included in our OPTICS's fit right now is the one introduced in [2] which is well cited. Let's call this method extract_SQLNK (for the lack of a name, taking the first character of the authors' surnames). The article argues that it produces better clusters than extract_xi, with some empirically found default values for the parameters.

Current Status

At the moment, calculating the reachability plot is taken from @espg's implementation [3], and extract_SQLNK (not with this name though) is taken from @amyxzhang's implementatoin [4].

Although the OPTICS class includes an extract_dbscan method, it's not the one used in the fit function. The fit function calculates the reachability plot and then uses extract_SQLNK to extract the clusters.

Right not the code does not include an implementation of extract_xi.

Proposal

We need 3 extract_* methods corresponding to the above 3 methods. Two of them are already implemented. We only need to implement the extract_xi.

The constructor takes a parameter as extract_method which specifies which extract method should be called in fit.

The extract_* methods run fast (O(n)) once the reachability plot is calculated. So it makes sense to have them as public functions and let the user call them after fit. The fit function uses the output of one of these extract_* functions to set the labels_. Calling an extract_* function explicitly, won't change any attribute of the class.

The extract_xi and extract_SQLNK methods are allowed to have fixes and additions on top of what's proposed in the original articles, with the same spirit that it's done in @amyxzhang's implementation and in R [5]:

The used algorithm was originally contributed by the ELKI framework, but contains a set of fixes.

@amyxzhang, @espg: is the above analysis sound to you? Am I missing something?
@jnothman, @amueller, @rth, @qinhanmin2014 (and others), what do you think?

[1] OPTICS: Ordering Points To Identify the Clustering Structure
[2] Automatic Extraction of Clusters from Hierarchical Clustering Representations
[3] https://github.com/espg/OPTICS
[4] https://github.com/amyxzhang/OPTICS-Automatic-Clustering
[5] https://rdrr.io/cran/dbscan/man/optics.html

@espg
Copy link
Contributor
espg commented Sep 10, 2018

@adrinjalali This is a good overview, and I think that it captures the OPTICS algorithm and implementation well. The issue that you opened #11929 deals with a test that currently serves two purposes:

  1. Verifies that optics reachability values are canonically consistent across code updates. Any changes to the reachability graph may or may not change extraction results that are checked in other tests; however, as you note, there is only one way to extract the reachability values, and we would like to know if they change or break when updating the OPTICS algorithm in sklearn.

  2. Verifies that optics reachability values are consistent between implementations. Since there is only one way/definition to build the graph, independent implementations should have values that agree to within machine precision. In practice, they may be different for a number of reasons-- e.g. some implementations define a neighborhood for search and process points in that neighborhood before defining a new neighborhood, while others rebuild the neighborhood per each point.

Currently, the test in #11929 fails due to numerical precision issues when using 32bit installs. Switching to 50 (instead of 250) points per cluster fixes numerical issue between builds, but breaks the second purpose listed above. The test is still valuable for the first purpose above (i.e., ensuring our implementation doesn't 'drift' without our knowledge).

I'm not sure how important the second point is for us... but if it is needed, we may want to check the reference 'chemometria' implementation. The python implementation from Brian Clowers is a port of the original code (written in MATLAB) to python. The port doesn't have any unit tests or other checks, so differences at lower point counts might be a bug in the port that isn't present in the original code...

@jnothman
Copy link
Member
jnothman commented Sep 12, 2018 via email

@adrinjalali
Copy link
Member Author
adrinjalali commented Sep 12, 2018

The xi extraction method is the one included in R and ELKI. In a sense, I suppose it's the one people who are familiar with OPTICS, expect to have as the standard extraction method.

It is not superseded by sqlnk IMO. sqlnk detects better hierarchies when clusters of higher densities are embodied inside clusters of lower densities, but it labels more points as noise if we take the lowest level of the hierarchy (which we do by default).

It also happened to us at the beginning in #11677, that we were comparing the results of our implementation with R, which uses xi or just DBSCAN, and of course the results are not comparable.

From what I understand, it should be easier to implement than sqlnk, and we have the R implementation in the R package. I'm guessing it shouldn't be too hard to port it to python.

@jnothman
Copy link
Member
jnothman commented Sep 13, 2018 via email

@adrinjalali
Copy link
Member Author

Does that mean I need to implement the extractXi method without looking at the R code? The relevant part is in R (I initially by mistake thought it's in the C++ part), and I'm reimplementing it in Python. I'm genuinely not sure if this violates the license. Of course I'm also more than happy for us to explicitly ask the package maintainers/authors.

@jnothman
Copy link
Member
jnothman commented Sep 13, 2018 via email

@adrinjalali
Copy link
Member Author

Question: Which one is preferred? OPTICS init should take:

  • the union of parameters to all extract methods
  • take **kwargs to pass to them
  • take a extract_params: dict to pass to the extract methods

@jnothman
Copy link
Member
jnothman commented Sep 16, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0