-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
OPTICS test_reach_dists inappropriate #12090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting. Our policy seems appropriate, although more computationally
complex.
Should the test still be valid in the min_samples=2 case?
|
I guess it will be valid when min_samples=2 (not sure though), because when min_samples=2, reachability-distance means distance between points and we'll always move on to the closest point (when there're multiple points, we'll always choose the one with smallest index). |
So as long as we don't encounter this edge case, the two algorithms should return the same RD, which I guess would be the case for our randomly generated data. I don't think that renders the test invalid, it just means we should avoid this edge case and document it in the test. |
How can you prove that in our (randomly) generated dataset, there will never be two points with the same RD (I think it's impossible unless min_samples=2)? I don't think it's an edge case. After picking point 0 at the beginning, there're at least (n_samples-1) points with the same RD and these two implementations will choose different point unless the first point with smallest RD is closest to point 0. |
And yes, maybe we can carefully construct a test to avoid the difference, but honestly I don't like the idea. We compare our implementation with an existing implementation but seems that the output of our implementation is different even with the example provided by the existing implementation. |
Awesome point @qinhanmin2014 , fair enough, Although I'm not sure why our generated RDs are identical to theirs. What's your proposal here? |
I'd rather regard it as an coincidence. Actually, I'm not able to reproduce the test (i.e., I get different results even with current test).
Maybe remove the test. I'm not sure at this point. It's good to compare with other implementations but I think we need to ensure that the implementation we compare is the same as ours. |
Regardless of which implementation we're comparing with (or even if do compare with another implementation), I think that we want this test. All of the other tests check the output of the clusters-- this is the only test that directly verifies that the reachability distances aren't changing as people update code. I think that we need a test that verifies that our reachability distances are invariant from one release to the next... |
So what's your proposal here @espg ? |
Things are getting more and more complex here. For the above example:
We get another result from R dbscan:
I'm not able to figure out how R works immediately. |
I think that the R implementation is the more appropriate one to focus on, as it is better tested and in general use. We get near identical results for reachability, but not order, and it seems to be based on how they are expanding their neighborhood queries. Our implementation uses the next nearest point. The paper referenced by @kno10 states that (for the R version):
I don't think that this is a sensible way for the algorithm to run as is makes reproducibility extremely difficult to verify... but I'd like to see if there's a way to force comparison between R and our implementation that bypasses this quirk. Perhaps we can produce an example that ensures that priority queue in the R version stays full, and verify that the reachability distances are the same for the same given queue, if that is in fact the source of the disagreement. |
I think another thing we need to consider is whether it is worthwhile to sacrifice the performance here? |
All implementations I know do not actually choose at random, but the next unprocessed point (that is also how the original pseudocode does this). The formulation "at random" does not imply uniform random sampling - just that there is no "provable" logic which point to use next, so we may choose any. |
@kno10 So you mean when there're multiple points with the same reachability-distance, we should choose the point with the smallest index? Do you have some references/implementations to support this way? Thanks in advance. |
@qinhanmin2014 , I think that's how it's done in ELKI's OPTICSHeap implementation, which uses
|
@kno10 , as far as I see, you're the main author regarding the OPTICS related files in ELKI. As you see, there are many details of the algorithm which are not directly mentioned in the original paper, and since we'd like to not diverge from the other implementations (mainly ELKI and R), we end up checking those code bases. But since both those packages are released under AGPL, we can't directly port them, and since I'm not sure what counts as deriving from that code base and what not, I feel uncomfortable whenever I go and check something from that implementation (#12036 (comment), 8af9364). I guess ideally, from |
A) choosing the next unprocessed point on an empty heap: See Figure 5 in the original OPTICS paper. It prefers points in index order as starting points. But there is no particular reason for this except that it is very cheap. So this actually is in the paper. B) sorting candidates by reachability, then index: It obviously is not necessary to do it this way; it is just an obvious approach to make results less random. In fact, I don't really every object having an integer index much: once you think about working with dynamic data (and not just indexes). As for direct porting - I consider it valuable to see different interpretations. In particular of that Xi approach, because it does have some ambiguity. Maybe there is a different interpretation that works better. IMHO it is more important to follow certain data structure / efficiency considerations (such as the memory issue, and queries - what is actually in the paper) than about more subtle interpretation of what is not clear in the paper. |
Thanks @kno10 |
I'm still confused as to why the two different implementations differ on there output. As @kno10 mentions:
That's true. It's true for our implementation as well; we expand the cluster order for a given seed until the list is empty, and if the list is empty, we take the next unprocessed point in the dataset as the starting point for a new seed to expand. The problem is that this only happens if the seed list is empty-- and if we look at simple examples containing only a few points with max_bound / eps set either high or to inf, we should be still getting the same cluster order and reachability distances between the implementations. What I mean is, we should be expanding a single seed that will contain all data points. My guess from looking at the output of R versus our implementation is that both calculate the reachability distance the same way. But, they seem to produce a different cluster order; and as soon as you change the ordering output, there will be differences in reachability because the distance calculated will be between different points-- i.e., the reachability distances are likely valid for the output point order, but the orders are divergent. I'd really like to know why the algorithms differ in their output ordering. This comment shows runs of both algorithms on a small dataset, with an eps set to '10' so that all points are considered within the neighborhood for the first and subsequent call of 'OrderSeeds': This is the original pseudo-code for 'OrderSeeds': Note that the pseudo-code doesn't contain an invariant on the order of the candidates for the first time it's run; but, the paper includes this text, on the same page:
This implies to me that whatever is happening in the algorithm, the output from this call: i.e., OrderSeeds.next(), which defines currentObject and hence the ordering of the graph, is going to be the point or object that has the lowest reachability as calculated by OrderSeeds. As I mention here: #11677 (comment)
My question, is why? The pseudo-code for OrderSeeds defines What is the logic for the R version choosing point 15 in the referenced dataset? |
I've been sitting here reading my analysis above and editing (and re-editing) the last part of the text for the past quarter hour. I'm beginning to get the sinking feeling that both R and sklearn are implementing the algorithm wrong. I'm fairly certain that point 15 isn't the appropriate choice for the 2nd member of the reachability graph. I can't think of any reason that it would be. However, looking at the pseudo-code from the original paper, I'm at a bit of a loss to explain why point 13 has a reachability distance of 0.29266360833. Both R and Sklearn agree on this. They seem to be getting that number as the core_distance of point 5, which is what R and ELKI term the predecessor of 13. The problem is that point 5, when queried from the first point of the dataset, has distance of 2.703934... so calling max(2.703934, 0.29266360833) should yield 2.7 and not 0.29. I find myself wondering why that distance isn't 0.333208 instead... @kno10 is it possible that both of us are getting the reachability wrong for point 13? |
(the filter is necessary to change indexing to start at 0) gives
so you can see the ID preference is also right now doing the wrong thing in ELKI - it is preferring the larger ids at same reachability. If I fix this bug, I get:
that should be the results you get when you do the algorithm with pen and paper, but I didn't double check it again. As you can see, there can be quite some variability in "correct" results depending on the order in which you process the tied results, because points within the core radius will be tied - i.e., we see a lot of ties. Nevertheless, there is a mismatch in reachdist. You can see the first points (3 and 6) have reachdist 0.3332 = core distance of point 0, their predecessor. In the second run, point 12 later became reachable at a lower distance, but initially (c.f. first run) it also had 0.3332, the core distance of 0. As a very simple unit test, you can verify that the reachability of the second line must be the core-distance of the first line. (ELKI does not store the core distances anymore, because we don't use them anywhere later, but it would be trivial to add). |
@qinhanmin2014 I think that the discussion and analysis has found out the following:
Once we fix this #12123 issue in the main body of the OPTICS code, we can either:
|
I don't think the R implementation is broken. As far as I can tell it was tested against ELKI. It doesn't use a heap, but tries to pick objects in the same order as ELKI: |
The R implementation chooses point 14 as the 1st ordered point in the reachability graph (i.e., as the first point after the initialization point, the initialization point being point 0 and having an infinite reachability distance). As you have pointed out, there are multiple correct choices for the 2nd point due to ties:
I don't think that that is up to debate-- and the output from @adrinjalali in this #11677 (comment) shows that the idx's of the first two points in the R ordering are [0, 14, ...]. |
As far as I can tell, it's likely the output code you used that produced a wrong table. Instead of printing them as is, print reachabilty[order], coredistance[order] etc. |
@kno10, really good point! Silly me.
|
Note that the same indexing confusion exists in the sklearn API... (and I would expect that this will frequently come up with users, that they expect reachability and core distances to be in cluster order not data order - OPTICS usually uses cluster order, although I agree that most of the time object order is more convenient!) A typical side effect of not having a type safe API, that uses integers as identifiers. @adrinjalali in python you also need to take the cluster order as index to better interpret the results in #11677 (comment) If you go to the bottom table ("The results kinda seem closer now!") and reorder it based on the cluster order, then the second line does get the right reachability distance. |
Yeah true. For me one thing which helped was the distinction between |
Yes, this is confusing. Before reading the test of R dbscan, I failed to recognize that RD/CD from R is in cluster order. Maybe we can note down our way here. At this point I'm not persuaded to use cluster order instead of object order.
I'm fine with both way, or maybe we can combine them (construct a small dataset and compare with other implementations). Couple of things in my mind currently: |
I believe good example and explicit documentation of the storage layout is necessary. ("Reachabilities, indexed by object number not in cluster order, use `r.reachability_[r.clusterorder_] to access in cluster order" etc.) Sorting the result will not work reliably enough. Unless you fix the ordering using the IDs on ties, the results can also vary in their reachability distance, because once a point is added to the output, the reachability is not updated afterwards anymore. Depending on the exploration order, points may have more than one path with different reachabilities. If you'd want to fix that, go for HDBSCAN* instead. Always choosing the lowest ID on ties should make the result deterministic though (except for numerical differences though, see, e.g., #12128)
|
I've submitted #12132 to update the doc, so we can focus on the implementation here. |
I looked into the code a bit, and I think there is another major flaw in the way the next point is chosen. Precisely, the error that I am referring to is here: scikit-learn/sklearn/cluster/optics_.py Lines 506 to 507 in 0c0a9e8
In this snippet, the next point is chosen only from Here is a simple test case, try it on paper: scikit-learn/sklearn/cluster/optics_.py Lines 467 to 469 in 0c0a9e8
then you'll see that you get points that are unprocessed, but reachable! Unit test: 5de53e1 |
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. This commit still contains two solutions: 1. a fixed cython quick_scan (but not sure if indexes are always a long?) 2. a numpy solution, but I don't know how cheap and fast masked arrays are If benchmarking shows that scikit-learn#2 is faster, you can remove _optics_inner.pyx and un-import quick_scan.
And a fix is here:
This one also stores the predecessor, and compares results against ELKI's result. Again, 32 bit may (see the old test) yield some numeric issues. Also, using "euclidean" instead of "minkowski" breaks the test at least with the numpy version because of the numeric instability of the faster euclidean distance in sklearn (c.f. #9354 and others). The softened |
@kno10 Thank you for diving into this and finding the source of the error, I appreciate it. I should have a bit of time on Monday to look into the patches and run a few benchmarks to see the impact of one fix versus the other. |
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. This commit still contains two solutions: 1. a fixed cython quick_scan (but not sure if indexes are always a long?) 2. a numpy solution, but I don't know how cheap and fast masked arrays are If benchmarking shows that scikit-learn#2 is faster, you can remove _optics_inner.pyx and un-import quick_scan.
...I'm still running benchmarks on some larger datasets, but so far it looks like the cython performance gain is pretty marginal (i.e., about 20% faster at best, and tied with numpy in some cases). It may be worth removing quick_scan and going with the pure python module. |
@kno10 Do you have time to submit a PR to fix the RD/CD/order of our OPTICS? You can also add your test (with ELKI) in the same PR. Since our euclidean distance is sometimes problematic, I'm fine with using other distance metrics. It's not urgent and we just want to know your plan. Thanks in advance. |
@qinhanmin2014 I can submit a PR for above branch. I was waiting for @espg's final feedback on whether to keep the cython module, or drop it. |
@kno10 I think we can drop the cython module, and take the pure python solution that implements predecessor... |
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
* Add a test case for OPTICS bug (closes #12134) * ENH Fix processing order in OPTICS. See #12090 The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
* Add a test case for OPTICS bug (closes scikit-learn#12134) * ENH Fix processing order in OPTICS. See scikit-learn#12090 The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
* Add a test case for OPTICS bug (closes scikit-learn#12134) * ENH Fix processing order in OPTICS. See scikit-learn#12090 The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
I begin to suspect the correctness of the test. Seems that our implementation and the referenced implementation are not the same. In our implementation, when there're multiple points with the same reachability-distance, we choose the point which is closest to current point. In the referenced implementation, we choose the point with the smallest index. See the example below (taken from the referenced implementation):
We get the same CD, but different order (e.g., after choosing point 0, point 1&3&4&6 have the same RD, our implementation choose point 3 because it's closest to point 0, the referenced implementation choose point 1 because it has the smallest index), thus different RD (RD of current point is only related to previous points)
Correct me if I'm wrong @jnothman @espg @adrinjalali
The text was updated successfully, but these errors were encountered: