8000 OPTICS test_reach_dists inappropriate · Issue #12090 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

OPTICS test_reach_dists inappropriate #12090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue Sep 16, 2018 · 39 comments · Fixed by #12357
Closed

OPTICS test_reach_dists inappropriate #12090

qinhanmin2014 opened this issue Sep 16, 2018 · 39 comments · Fixed by #12357

Comments

@qinhanmin2014
Copy link
Member

I begin to suspect the correctness of the test. Seems that our implementation and the referenced implementation are not the same. In our implementation, when there're multiple points with the same reachability-distance, we choose the point which is closest to current point. In the referenced implementation, we choose the point with the smallest index. See the example below (taken from the referenced implementation):

X = np.array([[ 15.,  70.], [ 31.,  87.], [ 45.,  32.], [ 32.,  83.],
              [ 26.,  50.], [  7.,  31.], [ 43.,  97.]])
###
# result from our implementation
clust = OPTICS(min_samples=5)
clust.fit(X)
clust.core_distances_
# array([38.89730068, 37.33630941, 52.63078947, 33.54101966, 33.54101966,
#        57.69748695, 49.979996  ])
clust.reachability_
# array([        inf, 33.54101966, 33.54101966, 38.89730068, 33.54101966,
#        33.54101966, 33.54101966])
clust.ordering_
# array([0, 3, 1, 6, 4, 2, 5])
###
# result from referenced implementation
RD, CD, order = optics(X, 4)
CD
# array([38.89730068, 37.33630941, 52.63078947, 33.54101966, 33.54101966,
#        57.69748695, 49.979996  ])
RD
# array([ 0.        , 38.89730068, 33.54101966, 37.33630941, 33.54101966,
#        33.54101966, 33.54101966])
order
# [0, 1, 3, 4, 2, 5, 6]

We get the same CD, but different order (e.g., after choosing point 0, point 1&3&4&6 have the same RD, our implementation choose point 3 because it's closest to point 0, the referenced implementation choose point 1 because it has the smallest index), thus different RD (RD of current point is only related to previous points)
Correct me if I'm wrong @jnothman @espg @adrinjalali

@jnothman
Copy link
Member
jnothman commented Sep 16, 2018 via email

@qinhanmin2014
Copy link
Member Author

Should the test still be valid in the min_samples=2 case?

I guess it will be valid when min_samples=2 (not sure though), because when min_samples=2, reachability-distance means distance between points and we'll always move on to the closest point (when there're multiple points, we'll always choose the one with smallest index).

@adrinjalali
Copy link
Member

So as long as we don't encounter this edge case, the two algorithms should return the same RD, which I guess would be the case for our randomly generated data. I don't think that renders the test invalid, it just means we should avoid this edge case and document it in the test.

@qinhanmin2014
Copy link
Member Author

So as long as we don't encounter this edge case, the two algorithms should return the same RD, which I guess would be the case for our randomly generated data.

How can you prove that in our (randomly) generated dataset, there will never be two points with the same RD (I think it's impossible unless min_samples=2)? I don't think it's an edge case. After picking point 0 at the beginning, there're at least (n_samples-1) points with the same RD and these two implementations will choose different point unless the first point with smallest RD is closest to point 0.

@qinhanmin2014
Copy link
Member Author

And yes, maybe we can carefully construct a test to avoid the difference, but honestly I don't like the idea. We compare our implementation with an existing implementation but seems that the output of our implementation is different even with the example provided by the existing implementation.

@adrinjalali
Copy link
Member

Awesome point @qinhanmin2014 , fair enough, Although I'm not sure why our generated RDs are identical to theirs. What's your proposal here?
Also, I don't think as far as checking the RD generated by the algorithm goes, the test is invalid. Maybe comparing it to another implementation which is incompatible with ours is invalid.

@qinhanmin2014
Copy link
Member Author

Although I'm not sure why our generated RDs are identical to theirs.

I'd rather regard it as an coincidence. Actually, I'm not able to reproduce the test (i.e., I get different results even with current test).

What's your proposal here?

Maybe remove the test. I'm not sure at this point. It's good to compare with other implementations but I think we need to ensure that the implementation we compare is the same as ours.

@espg
Copy link
Contributor
espg commented Sep 17, 2018

Regardless of which implementation we're comparing with (or even if do compare with another implementation), I think that we want this test. All of the other tests check the output of the clusters-- this is the only test that directly verifies that the reachability distances aren't changing as people update code. I think that we need a test that verifies that our reachability distances are invariant from one release to the next...

@qinhanmin2014
Copy link
Member Author

So what's your proposal here @espg ?
Is there any reference/packages to support your implementation (i.e., choose the closest point when there're multiple point with the same RD)?

@qinhanmin2014
8000
Copy link
Member Author

Things are getting more and more complex here. For the above example:

X = np.array([[ 15.,  70.], [ 31.,  87.], [ 45.,  32.], [ 32.,  83.],
              [ 26.,  50.], [  7.,  31.], [ 43.,  97.]])

We get another result from R dbscan:

# CD is the same
# RD from referenced implementation
# 0. , 38.89730068, 33.54101966, 37.33630941, 33.54101966...
# RD from our implementation
# inf, 33.54101966, 33.54101966, 38.89730068, 33.54101966 ...
# RD from R
# Inf 33.5410196624968 33.5410196624968 33.5410196624968 38.8973006775534...
# order from referenced implementation
# 0, 1, 3, 4, 2, 5, 6
# order from our implementation
# 0, 3, 1, 6, 4, 2, 5
# order from R
# 0 6 4 5 3 2 1

I'm not able to figure out how R works immediately.

@espg
Copy link
Contributor
espg commented Sep 17, 2018

I think that the R implementation is the more appropriate one to focus on, as it is better tested and in general use. We get near identical results for reachability, but not order, and it seems to be based on how they are expanding their neighborhood queries. Our implementation uses the next nearest point. The paper referenced by @kno10 states that (for the R version):

OPTICS uses a priority queue for processing points. Initially the queue is empty, and if a maximum distance εmax is used, it may become empty again. If empty, an unprocessed point is chosen at random with a reachability of ∞.

I don't think that this is a sensible way for the algorithm to run as is makes reproducibility extremely difficult to verify... but I'd like to see if there's a way to force comparison between R and our implementation that bypasses this quirk. Perhaps we can produce an example that ensures that priority queue in the R version stays full, and verify that the reachability distances are the same for the same given queue, if that is in fact the source of the disagreement.

@qinhanmin2014
Copy link
Member Author

Our policy seems appropriate

I think another thing we need to consider is whether it is worthwhile to sacrifice the performance here?

@kno10
Copy link
Contributor
kno10 commented Sep 17, 2018

All implementations I know do not actually choose at random, but the next unprocessed point (that is also how the original pseudocode does this). The formulation "at random" does not imply uniform random sampling - just that there is no "provable" logic which point to use next, so we may choose any.
Similarly, on the heap, you may want to order by (reachability, index) instead of just reachability to get more stable results for unit testing.

@qinhanmin2014
Copy link
Member Author

Similarly, on the heap, you may want to order by (reachability, index) instead of just reachability to get more stable results for unit testing.

@kno10 So you mean when there're multiple points with the same reachability-distance, we should choose the point with the smallest index? Do you have some references/implementations to support this way? Thanks in advance.

@adrinjalali
Copy link
Member

@qinhanmin2014 , I think that's how it's done in ELKI's OPTICSHeap implementation, which uses OPTICSHeapEntry, which is a comparable and has the following:

  public int compareTo(OPTICSHeapEntry o) {
    if(this.reachability < o.reachability) {
      return -1;
    }
    if(this.reachability > o.reachability) {
      return +1;
    }
    return -DBIDUtil.compare(objectID, o.objectID);
  }

@adrinjalali
Copy link
Member

@kno10 , as far as I see, you're the main author regarding the OPTICS related files in ELKI. As you see, there are many details of the algorithm which are not directly mentioned in the original paper, and since we'd like to not diverge from the other implementations (mainly ELKI and R), we end up checking those code bases.

But since both those packages are released under AGPL, we can't directly port them, and since I'm not sure what counts as deriving from that code base and what not, I feel uncomfortable whenever I go and check something from that implementation (#12036 (comment), 8af9364).

I guess ideally, from scikit-learn's perspective, you and other authors would give us permission to use the code. What do you think?

@kno10
Copy link
Contributor
kno10 commented Sep 19, 2018

A) choosing the next unprocessed point on an empty heap: See Figure 5 in the original OPTICS paper. It prefers points in index order as starting points. But there is no particular reason for this except that it is very cheap. So this actually is in the paper.

B) sorting candidates by reachability, then index: It obviously is not necessary to do it this way; it is just an obvious approach to make results less random. In fact, I don't really every object having an integer index much: once you think about working with dynamic data (and not just indexes).
But in ELKI we have a heap-based and a list-based approach. By adding this tie breaker, both produce the same result, which makes unit testing easier. In particular if you replace the heap implementation. This makes the heap return objects in a deterministic order.

As for direct porting - I consider it valuable to see different interpretations. In particular of that Xi approach, because it does have some ambiguity. Maybe there is a different interpretation that works better. IMHO it is more important to follow certain data structure / efficiency considerations (such as the memory issue, and queries - what is actually in the paper) than about more subtle interpretation of what is not clear in the paper.

@qinhanmin2014
Copy link
Member Author

Thanks @kno10
My point here is that it's not worthwhile to sacrifice the performance to find out the closest point (which will not improve the result of the algorithm too much I suppose). The two ways @kno10 provide are more reasonable than our way, I think.
I think we might run some benchmarks to see the performance of different methods to define the next point when there're multiple points with the same RD (i.e., choose point with smallest index/ choose point closest to current point)
Any thoughts @espg ?

@espg
Copy link
Contributor
espg commented Sep 19, 2018

I'm still confused as to why the two different implementations differ on there output. As @kno10 mentions:

All implementations I know do not actually choose at random, but the next unprocessed point (that is also how the original pseudocode does this).

That's true. It's true for our implementation as well; we expand the cluster order for a given seed until the list is empty, and if the list is empty, we take the next unprocessed point in the dataset as the starting point for a new seed to expand.

The problem is that this only happens if the seed list is empty-- and if we look at simple examples containing only a few points with max_bound / eps set either high or to inf, we should be still getting the same cluster order and reachability distances between the implementations. What I mean is, we should be expanding a single seed that will contain all data points.

My guess from looking at the output of R versus our implementation is that both calculate the reachability distance the same way. But, they seem to produce a different cluster order; and as soon as you change the ordering output, there will be differences in reachability because the distance calculated will be between different points-- i.e., the reachability distances are likely valid for the output point order, but the orders are divergent.

I'd really like to know why the algorithms differ in their output ordering. This comment shows runs of both algorithms on a small dataset, with an eps set to '10' so that all points are considered within the neighborhood for the first and subsequent call of 'OrderSeeds':

#11677 (comment)

This is the original pseudo-code for 'OrderSeeds':

order_seeds

Note that the pseudo-code doesn't contain an invariant on the order of the candidates for the first time it's run; but, the paper includes this text, on the same page:

The objects contained in OrderSeeds are sorted by their reachability-distance to the closest core object from which they have been directly density- reachable.

This implies to me that whatever is happening in the algorithm, the output from this call:

loop

i.e., OrderSeeds.next(), which defines currentObject and hence the ordering of the graph, is going to be the point or object that has the lowest reachability as calculated by OrderSeeds.

As I mention here: #11677 (comment)

R version goes to point 15, with coordinates (0.178222896030858, 0.028963671017735), and a distance of 1.76044. Sklearn version goes to point 13, with coordinates (−0.0392695355504, 1.94034418314), and a much closer distance of 0.17871.

My question, is why? The pseudo-code for OrderSeeds defines c_dist as the core_distance of the query point, which both algorithms assign to be the first point in the dataset, and both algorithms agree is ~0.333208. With the epsilon distance being 10, and all points being unprocessed, the new_r_dist for every point in the dataset will be the max of distance between the first point (i.e. the query point), and ~0.333208. For point 13, that's max(0.17871, 0.333208) = 0.333208. For point 15, that's max(1.76044, 0.333208) = 1.76044.

What is the logic for the R version choosing point 15 in the referenced dataset?

@espg
Copy link
Contributor
espg commented Sep 19, 2018

I've been sitting here reading my analysis above and editing (and re-editing) the last part of the text for the past quarter hour.

I'm beginning to get the sinking feeling that both R and sklearn are implementing the algorithm wrong.

I'm fairly certain that point 15 isn't the appropriate choice for the 2nd member of the reachability graph. I can't think of any reason that it would be.

However, looking at the pseudo-code from the original paper, I'm at a bit of a loss to explain why point 13 has a reachability distance of 0.29266360833. Both R and Sklearn agree on this. They seem to be getting that number as the core_distance of point 5, which is what R and ELKI term the predecessor of 13. The problem is that point 5, when queried from the first point of the dataset, has distance of 2.703934... so calling max(2.703934, 0.29266360833) should yield 2.7 and not 0.29. I find myself wondering why that distance isn't 0.333208 instead...

@kno10 is it possible that both of us are getting the reachability wrong for point 13?

@kno10
Copy link
Contributor
kno10 commented Sep 20, 2018
java -jar elki.jar cli -dbc.in test.csv  -dbc.filter FixedDBIDsFilter
-algorithm clustering.optics.OPTICSHeap -optics.minpts 5

(the filter is necessary to change indexing to start at 0) gives

ID=0 -0.089691454662498 1.76889309153948 reachdist=Infinity predecessor=null
ID=12 -0.039269535550381 1.94034418313686 reachdist=0.33320865127238436 predecessor=0
ID=9 -0.013878701211967 2.00049377768281 reachdist=0.1787116301108633 predecessor=12
ID=6 0.070795472927173 1.88000741803561 reachdist=0.1787116301108633 predecessor=12
ID=3 -0.113037567424629 2.10128286921271 reachdist=0.1787116301108633 predecessor=12
ID=5 0.013242028438109 0.209081920524915 reachdist=1.5632038226555907 predecessor=0
ID=14 0.178222896030858 0.028963671017735 reachdist=0.25182014195891567 predecessor=5
ID=11 0.098175277746366 0.047723730261362 reachdist=0.244256566906986 predecessor=14
ID=8 0.198447393665293 0.195465164222325 reachdist=0.18234616092253633 predecessor=11
ID=2 0.158784533120882 0.003580671801523 reachdist=0.18234616092253633 predecessor=11
ID=13 1.89603310230511 0.079220327029965 reachdist=1.7015610774674026 predecessor=8
ID=10 2.04176507507926 -0.245170638784613 reachdist=0.35562242138363587 predecessor=13
ID=7 1.97603019758282 0.158963820029007 reachdist=0.35562242138363587 predecessor=13
ID=4 1.9919748243449 0.043226515453962 reachdist=0.35562242138363587 predecessor=13
ID=1 2.01848491846467 0.087860458092127 reachdist=0.29266360833062055 predecessor=4

so you can see the ID preference is also right now doing the wrong thing in ELKI - it is preferring the larger ids at same reachability. If I fix this bug, I get:

ID=0 -0.089691454662498 1.76889309153948 reachdist=Infinity predecessor=null
ID=3 -0.113037567424629 2.10128286921271 reachdist=0.33320865127238436 predecessor=0
ID=6 0.070795472927173 1.88000741803561 reachdist=0.33320865127238436 predecessor=0
ID=9 -0.013878701211967 2.00049377768281 reachdist=0.2876758801474613 predecessor=6
ID=12 -0.039269535550381 1.94034418313686 reachdist=0.24369335528038538 predecessor=9
ID=5 0.013242028438109 0.209081920524915 reachdist=1.5632038226555907 predecessor=0
ID=2 0.158784533120882 0.003580671801523 reachdist=0.25182014195891567 predecessor=5
ID=8 0.198447393665293 0.195465164222325 reachdist=0.25182014195891567 predecessor=5
ID=11 0.098175277746366 0.047723730261362 reachdist=0.19594080978232742 predecessor=8
ID=14 0.178222896030858 0.028963671017735 reachdist=0.18234616092253633 predecessor=11
ID=13 1.89603310230511 0.079220327029965 reachdist=1.7015610774674026 predecessor=8
ID=1 2.01848491846467 0.087860458092127 reachdist=0.35562242138363587 predecessor=13
ID=4 1.9919748243449 0.043226515453962 reachdist=0.33384379158361543 predecessor=1
ID=7 1.97603019758282 0.158963820029007 reachdist=0.29266360833062055 predecessor=4
ID=10 2.04176507507926 -0.245170638784613 reachdist=0.29266360833062055 predecessor=4

that should be the results you get when you do the algorithm with pen and paper, but I didn't double check it again.

As you can see, there can be quite some variability in "correct" results depending on the order in which you process the tied results, because points within the core radius will be tied - i.e., we see a lot of ties.

Nevertheless, there is a mismatch in reachdist. You can see the first points (3 and 6) have reachdist 0.3332 = core distance of point 0, their predecessor. In the second run, point 12 later became reachable at a lower distance, but initially (c.f. first run) it also had 0.3332, the core distance of 0.

As a very simple unit test, you can verify that the reachability of the second line must be the core-distance of the first line. (ELKI does not store the core distances anymore, because we don't use them anywhere later, but it would be trivial to add).

@espg
Copy link
Contributor
espg commented Sep 20, 2018

So what's your proposal here @espg ?
Is there any reference/packages to support your implementation (i.e., choose the closest point when there're multiple point with the same RD)?

@qinhanmin2014 I think that the discussion and analysis has found out the following:

  • Our current implementation of OPTICS has a bug (thanks @kno10 ) in calculating the reachability distances, or at least in calculating the first distance
  • The reference implementation that we were comparing to has the same bug, so can't use it
  • The R implementation does not have the same bug re: reachability distances, but has a different bug affecting the ordering of the points, so cannot use it as reference.
  • There are multiple ways of ordering that are appropriate depending on how you deal with ties. (Note that the R version is still wrong despite this, as it's sorting is not a valid sort involving ties)

Once we fix this #12123 issue in the main body of the OPTICS code, we can either:

  • Compare with ELKI. ELKI seems to be doing things right with both the cluster order and the reachability distances. We may have to sort the results for the test as multiple output orders are allowed by the algorithm.
  • Compute a small dataset by hand, and compare against that (i.e. 10 or 15 points)

@kno10
Copy link
Contributor
kno10 commented Sep 20, 2018

I don't think the R implementation is broken. As far as I can tell it was tested against ELKI.

It doesn't use a heap, but tries to pick objects in the same order as ELKI:
https://github.com/mhahsler/dbscan/blob/4fc0072693d339bf7f0e5f4d86aee96e386ab675/src/R_optics.cpp#L147-L151
if that selection doesn't work, please file a bug with @mhahsler

@espg
Copy link
Contributor
espg commented Sep 20, 2018

The R implementation chooses point 14 as the 1st ordered point in the reachability graph (i.e., as the first point after the initialization point, the initialization point being point 0 and having an infinite reachability distance).

As you have pointed out, there are multiple correct choices for the 2nd point due to ties:

  • Point 3 is a valid choice for the first ordered point in the reachability graph.

  • Point 12 is a valid choice for the first ordered point in the reachability graph.

  • Point 14 isn't a valid choice for the first ordered point in the reachability graph.

I don't think that that is up to debate-- and the output from @adrinjalali in this #11677 (comment) shows that the idx's of the first two points in the R ordering are [0, 14, ...].

@kno10
Copy link
Contributor
kno10 commented Sep 20, 2018

As far as I can tell, it's likely the output code you used that produced a wrong table.

Instead of printing them as is, print reachabilty[order], coredistance[order] etc.
then the results will make sense. In particular, the clusters numbers will be consecutive...

@adrinjalali
Copy link
Member

@kno10, really good point! Silly me.
@espg, here's the table, and the corrected code for reference:

id reachdist coredist predecessor cluster
1 Inf 0.333208651272379 NA 1
13 0.333208651272379 0.17871163011086 1 1
10 0.17871163011086 0.243693355280387 13 1
7 0.17871163011086 0.287675880147458 13 1
4 0.17871163011086 0.333208651272379 13 1
6 1.56320382265559 0.251820141958916 1 2
15 0.251820141958916 0.244256566906985 6 2
12 0.244256566906985 0.182346160922536 15 2
9 0.182346160922536 0.195940809782328 12 2 67E6
3 0.182346160922536 0.251820141958916 12 2
14 1.7015610774674 0.355622421383634 9 3
11 0.355622421383634 0.409445643425411 14 3
8 0.355622421383634 0.409445643425411 14 3
5 0.355622421383634 0.292663608330619 14 3
2 0.292663608330619 0.333843791583614 5 3
set.seed(2)
n <- 15
x <- cbind(
  x = c(0, 2, 0) + rnorm(n, sd=0.1),
  y = c(2, 0, 0) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

install.packages('dbscan')
library(dbscan)

res = optics(x, eps = 10, minPts = 5)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .5)
cl$cluster
plot(cl)
hullplot(x, cl)

mat = matrix(c((1:n)[cl$order], cl$reachdist[cl$order],
               cl$coredist[cl$order], cl$predecessor[cl$order],
               cl$cluster[cl$order]), ncol=5)
colnames(mat)<- c('id', 'reachdist', 'coredist', 'predecessor', 'cluster')
write.csv(mat, file='/tmp/r.csv')

@kno10
Copy link
Contributor
kno10 commented Sep 21, 2018

Note that the same indexing confusion exists in the sklearn API... (and I would expect that this will frequently come up with users, that they expect reachability and core distances to be in cluster order not data order - OPTICS usually uses cluster order, although I agree that most of the time object order is more convenient!) A typical side effect of not having a type safe API, that uses integers as identifiers.

@adrinjalali in python you also need to take the cluster order as index to better interpret the results in #11677 (comment)

If you go to the bottom table ("The results kinda seem closer now!") and reorder it based on the cluster order, then the second line does get the right reachability distance.

@adrinjalali
Copy link
Member

Yeah true. For me one thing which helped was the distinction between reachability and reachability_plot (same for the other variables). We could introduce *_plot_ (or similar) parameters for our public attributes, to avoid the confusion.

@qinhanmin2014
Copy link
Member Author

Note that the same indexing confusion exists in the sklearn API... (and I would expect that this will frequently come up with users, that they expect reachability and core distances to be in cluster order not data order - OPTICS usually uses cluster order, although I agree that most of the time object order is more convenient!) A typical side effect of not having a type safe API, that uses integers as identifiers.

Yes, this is confusing. Before reading the test of R dbscan, I failed to recognize that RD/CD from R is in cluster order. Maybe we can note down our way here. At this point I'm not persuaded to use cluster order instead of object order.

Once we fix this #12123 issue in the main body of the OPTICS code, we can either:
Compare with ELKI. ELKI seems to be doing things right with both the cluster order and the reachability distances. We may have to sort the results for the test as multiple output orders are allowed by the algorithm.
Compute a small dataset by hand, and compare against that (i.e. 10 or 15 points)

I'm fine with both way, or maybe we can combine them (construct a small dataset and compare with other implementations). Couple of things in my mind currently:
(1) I prefer to compare CD, RD and order (if it is possible).
(2) min_samples = 2 does not seem to be enough from my side (at least we need another min_samples).

@kno10
Copy link
Contributor
kno10 commented Sep 22, 2018

I believe good example and explicit documentation of the storage layout is necessary. ("Reachabilities, indexed by object number not in cluster order, use `r.reachability_[r.clusterorder_] to access in cluster order" etc.)

Sorting the result will not work reliably enough. Unless you fix the ordering using the IDs on ties, the results can also vary in their reachability distance, because once a point is added to the output, the reachability is not updated afterwards anymore. Depending on the exploration order, points may have more than one path with different reachabilities. If you'd want to fix that, go for HDBSCAN* instead. Always choosing the lowest ID on ties should make the result deterministic though (except for numerical differences though, see, e.g., #12128)

min_samples=2 is not a reasonable choice for OPTICS (nor DBSCAN). With this parameter, you effectively disable the density part of OPTICS, and you are doing single-link clustering, and constructing the minimum spanning tree. There are probably better approaches for this particular scenario; and in a unit test it means you are not validating that the reachability is computed correctly: for trivial reasons, with minPts=2, d(x,y) >= coredist(x), and therefore reachability(y <- x)=max(dist(y,x), coredist(x))=dist(x,y) - you will not notice an error in computing reachabilities.

@qinhanmin2014
Copy link
Member Author

I've submitted #12132 to update the doc, so we can focus on the implementation here.

@kno10
Copy link
Contributor
kno10 commented Sep 22, 2018

I looked into the code a bit, and I think there is another major flaw in the way the next point is chosen.

Precisely, the error that I am referring to is here:

return (unproc[quick_scan(np.take(self.reachability_, unproc),
dists)])

In this snippet, the next point is chosen only from unproc. The quick_scan method prefers the point with the smallest distance to the current point. At first that seemed like an interesting idea to me (this should drive the algorithm to more local exploration on ties, so why not?) BUT then I realized that we should not always have these distances when using an index, but only to neighbors of the current point.
OPTICS however chooses from all current (unprocessed) seed points.

Here is a simple test case, try it on paper: [[0], [10], [-10], [25]]. It should process the points as 0, 1, 2, 3 (because 2 is reachable from 0 at reachdist 10), but the current code does 0, 1, 3, 2, and actually has to restart at point 2. If you would add a assert self.reachability_[point] == np.inf after this test:

for point in range(X.shape[0]):
if processed[point]:
continue

then you'll see that you get points that are unprocessed, but reachable!

Unit test: 5de53e1

kno10 added a commit to kno10/scikit-learn that referenced this issue Sep 22, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

This commit still contains two solutions:

1. a fixed cython quick_scan (but not sure if indexes are always a long?)
2. a numpy solution, but I don't know how cheap and fast masked arrays are

If benchmarking shows that scikit-learn#2 is faster, you can remove _optics_inner.pyx
and un-import quick_scan.
@kno10
Copy link
Contributor
kno10 commented Sep 23, 2018

And a fix is here:
https://github.com/kno10/scikit-learn/commits/patch-optics
But there are two alternatives, so you'll need to benchmark and pick one.

  1. a fixed cython quick_scan (but not sure if indexes are always a long - may need to use some other type for 32 bit compatibility?)
  2. a numpy argmin solution that does not need the .pyx anymore, but I don't know how cheap and fast masked arrays are. But probably not worse than the quick_scan preparation code, I guess.

This one also stores the predecessor, and compares results against ELKI's result. Again, 32 bit may (see the old test) yield some numeric issues. Also, using "euclidean" instead of "minkowski" breaks the test at least with the numpy version because of the numeric instability of the faster euclidean distance in sklearn (c.f. #9354 and others). The softened isclose tie-detection in quick_scan makes up for these errors IIRC.

@espg
Copy link
Contributor
espg commented Sep 23, 2018

@kno10 Thank you for diving into this and finding the source of the error, I appreciate it. I should have a bit of time on Monday to look into the patches and run a few benchmarks to see the impact of one fix versus the other.

kno10 added a commit to kno10/scikit-learn that referenced this issue Sep 23, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

This commit still contains two solutions:

1. a fixed cython quick_scan (but not sure if indexes are always a long?)
2. a numpy solution, but I don't know how cheap and fast masked arrays are

If benchmarking shows that scikit-learn#2 is faster, you can remove _optics_inner.pyx
and un-import quick_scan.
@espg
Copy link
Contributor
espg commented Sep 25, 2018

...I'm still running benchmarks on some larger datasets, but so far it looks like the cython performance gain is pretty marginal (i.e., about 20% faster at best, and tied with numpy in some cases). It may be worth removing quick_scan and going with the pure python module.

@qinhanmin2014
Copy link
Member Author

@kno10 Do you have time to submit a PR to fix the RD/CD/order of our OPTICS? You can also add your test (with ELKI) in the same PR. Since our euclidean distance is sometimes problematic, I'm fine with using other distance metrics. It's not urgent and we just want to know your plan. Thanks in advance.

@kno10
Copy link
Contributor
kno10 commented Oct 7, 2018

@qinhanmin2014 I can submit a PR for above branch. I was waiting for @espg's final feedback on whether to keep the cython module, or drop it.

@espg
Copy li 10000 nk
Contributor
espg commented Oct 7, 2018

@kno10 I think we can drop the cython module, and take the pure python solution that implements predecessor...

@qinhanmin2014
Copy link
Member Author

@kno10 I think we can drop the cython module, and take the pure python solution that implements predecessor...

So let's drop the current cython module. @kno10 Feel free to add a new one if you need.

kno10 added a commit to kno10/scikit-learn that referenced this issue Oct 11, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
kno10 added a commit to kno10/scikit-learn that referenced this issue Oct 11, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
kno10 added a commit to kno10/scikit-learn that referenced this issue Oct 11, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
kno10 added a commit to kno10/scikit-learn that referenced this issue Oct 13, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
kno10 added a commit to kno10/scikit-learn that referenced this issue Oct 13, 2018
The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
qinhanmin2014 pushed a commit that referenced this issue Oct 14, 2018
* Add a test case for OPTICS bug (closes #12134)

* ENH Fix processing order in OPTICS. See #12090

The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
anuragkapale pushed a commit to anuragkapale/scikit-learn that referenced this issue Oct 23, 2018
* Add a test case for OPTICS bug (closes scikit-learn#12134)

* ENH Fix processing order in OPTICS. See scikit-learn#12090

The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
xhluca pushed a commit to xhluca/scikit-learn that referenced this issue Apr 28, 2019
* Add a test case for OPTICS bug (closes scikit-learn#12134)

* ENH Fix processing order in OPTICS. See scikit-learn#12090

The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0