8000 OPTICS detecting the wrong outlier · Issue #11677 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

OPTICS detecting the wrong outlier #11677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adrinjalali opened this issue Jul 25, 2018 · 39 comments
Closed

OPTICS detecting the wrong outlier #11677

adrinjalali opened this issue Jul 25, 2018 · 39 comments
Milestone

Comments

@adrinjalali
Copy link
Member

The following example shows OPTICS on a small dataset, with one outlier. It seems that the outlier is incorrectly detected.

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [80, 70], [80, 80], [250, 800]])
>>> 
>>> 
>>> np.random.seed(0)
>>> n_points_per_cluster = 10
>>> 
>>> C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
>>> C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
>>> X = np.vstack((C1, C2, np.array([[100, 200]])))
>>> 
>>> clustering = OPTICS(metric="euclidean").fit(X)
>>> np.hstack((X, clustering.labels_[:, np.newaxis]))
array([[ -3.58875812,  -1.67987423,   0.        ],
       [ -4.21700961,  -0.20728544,   0.        ],
       [ -3.50595361,  -2.7818223 ,   0.        ],
       [ -4.23992927,  -2.12108577,   0.        ],
       [ -5.08257508,  -1.6715212 ,   0.        ],
       [ -4.88476514,  -0.83658119,   0.        ],
       [ -4.39116982,  -1.90265999,   0.        ],
       [ -4.64490941,  -1.73306054,   0.        ],
       [ -3.80473674,  -2.16412661,   0.        ],
       [ -4.74954584,  -2.68327659,   0.        ],
       [  3.74470102,  -0.93463814,  -1.        ],
       [  4.08644362,  -1.0742165 ,   1.        ],
       [  4.22697546,  -1.14543657,   1.        ],
       [  4.00457585,  -1.01871839,   1.        ],
       [  4.15327792,  -0.85306412,   1.        ],
       [  4.01549474,  -0.96218375,   1.        ],
       [  3.91122143,  -1.19807965,   1.        ],
       [  3.96520879,  -0.9843651 ,   1.        ],
       [  4.12302907,  -0.87976202,   1.        ],
       [  3.96126732,  -1.03023028,   1.        ],
       [100.        , 200.        ,   1.        ]])
>>> clustering # doctest: +NORMALIZE_WHITESPACE
OPTICS(algorithm='ball_tree', leaf_size=30, max_bound=inf, maxima_ratio=0.75,
    metric='euclidean', metric_params=None, min_cluster_size_ratio=0.005,
    min_maxima_ratio=0.001, min_samples=5, n_jobs=1, p=2,
    rejection_ratio=0.7, significant_min=0.003, similarity_threshold=0.4)
@adrinjalali
Copy link
Member Author

ping @GaelVaroquaux

@jnothman
Copy link
Member
jnothman commented Jul 25, 2018 via email

@espg
Copy link
Contributor
espg commented Jul 25, 2018

This looks to be a bug in _automatic_cluster. I'm guessing that it occurs when the leaf in the reachability tree is too small, and is merged to the adjacent leaf, instead of being marked as noise / removed. I'll see if I can find a fix (I think around lines 709 or 725).

@jnothman
Copy link
Member

Thank you both

@jnothman jnothman added the Bug label Jul 26, 2018
@jnothman jnothman added this to the 0.20 milestone Jul 26, 2018
@adrinjalali
Copy link
Member Author

(continuing from #3846)
@qinhanmin2014 I don't think I can come up with a bug-free example. Check the following:

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> 
>>> # Generate sample data
>>> 
>>> np.random.seed(0)
>>> for _ in range(10):
...     n_points_per_cluster = 250
...     C1 = [-50, -20] + .08 * np.random.randn(n_points_per_cluster, 2)
...     C2 = [40, -10] + .01 * np.random.randn(n_points_per_cluster, 2)
...     C3 = [10, -20] + .02 * np.random.randn(n_points_per_cluster, 2)
...     C4 = [-20, 30] + .03 * np.random.randn(n_points_per_cluster, 2)
...     C5 = [30, -20] + .06 * np.random.randn(n_points_per_cluster, 2)
...     C6 = [50, 60] + .02 * np.random.randn(n_points_per_cluster, 2)
...     X = np.vstack((C1, C2, C3, C4, C5, C6))
...     clustering = OPTICS(min_samples=9, rejection_ratio=0.5)
...     # Run the fit
...     clustering.fit(X)
...     print(np.sum(clustering.labels_ == -1))
5
5
5
5
5
5
5
5
5
5

I really don't think the data keeps having exactly 5 outliers. I also checked the example provided here and the algorithm detects way too many outliers IMO.

@jnothman
Copy link
Member
jnothman commented Jul 31, 2018 via email

@espg
Copy link
Contributor
espg commented Jul 31, 2018

@jnothman I'm on travel and won't be able to dedicate time on this until next week. The bug @adrinjalali first flagged in this thread needs to be investigated, but the recent example above doesn't seem to be incorrect to me. The clusters are widely spaced, so OPTICS evaluates all points as belonging to a given label; the 5 outliers are one point per transition to the next cluster (boundary point where the reachability distance spikes). OPTICS calculates reachability distances starting with the first point to all other points, and then removes the current point from future consideration and recursively repeats. That means that the reachability distances will be low for all points in the cluster under consideration except for the last point on the cluster-- since all it's neighbors have been removed, it will have a reachability distance based on distance to the next cluster (instead of inter-cluster distance), which is quite high in the case above. If you run the same test with DBSCAN, you'll get varying outliers depending on what the eps value is set to, but generally much higher in number for eps under 0.08, and then zero outliers if you set eps greater than twice that... the fact that OPTICS is only predicting one outlier per cluster means it's generally mapping the clusters better then DBSCAN (with less information, since eps doesn't need to be set), so this is a feature, not a bug.

I could see the expected behavior for OPTICS would be to have zero outliers for the case(s) above, and that might be achievable with some change to auto_extract...I'll take a look at this next week and see if there is a way to better handle boundary points.

@jnothman
Copy link
Member
jnothman commented Jul 31, 2018 via email

@qinhanmin2014
Copy link
Member

Maybe we can compare our implementation with R, or implementation from the original author if available. Tagging blocker.

@rth
Copy link
Member
rth commented Aug 1, 2018

Is this issue worth delaying the RC release for? It doesn't sound like it will be solved in the next few days... Maybe add this bug fix either in the final 0.20.0 or 0.20.1?

@adrinjalali
Copy link
Member Author

Here's a result from the R package (dbscan)

For the code:

set.seed(2)
n <- 400
x <- cbind(
  x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
  y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)

library(dbscan)

res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)

These are the calculated clusters (no outliers):

> cl$cluster
  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
 [60] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[119] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1
[178] 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
[237] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
[296] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[355] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

And the following plots:

order
reachability
convex_hulls

@espg
Copy link
Contributor
espg commented Aug 7, 2018

@adrinjalali can you please compare the reachability plot from R with the output from this implementation of OPTICS? I'm 95% sure that the issue is in the auto_extract function, but it would be good to be positive, and I'm not really familiar with R. If the bug is in auto_extract, the calculated and ordered reachability distances should be identical to within machine precision from the two implementations.

@adrinjalali
Copy link
Member Author
adrinjalali commented Aug 8, 2018

@espg unfortunately it doesn't seem to be the case. They're pretty different.

Here's the R code I used to create the data and get the reachability:

set.seed(2)
n <- 100
x <- cbind(
  x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
  y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

library(dbscan)

res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)

write.csv(matrix(c(cl$order, cl$reachdist), ncol=2), file='/tmp/r.csv')

And here's the python code to apply OPTICS on the same data:

import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
           np.transpose(np.array([clustering.ordering_,
                                  clustering.reachability_])))

Putting the two generated files together, here's what you'd see.

python order python reachability R order R reachability
0 inf 1 Inf
52 0.07329448816232 57 Inf
56 0.071566498480906 73 Inf
12 0.086188714113748 53 Inf
72 0.030358668261071 13 0.03035866826107
40 1.58645318232263 89 0.091487092011897
88 0.064431760283397 49 0.064431760283397
48 0.08831105629047 41 0.088311056290458
4 0.115609686828167 25 0.115609686828167
24 0.077812333737916 5 0.077812333737916
68 0.079401452929697 69 0.079401452929697
64 0.088171163308115 65 0.088171163308108
28 0.073131402725988 81 0.073131402725989
80 0.094438516074332 29 0.094438516074335
36 1.57660870727407 37 0.070793201620562
76 1.56178634568536 77 0.108586894358706
44 0.102916640800507 45 0.102916640800511
... ... ... ...

Does this help?

@espg
Copy link
Contributor
espg commented Aug 8, 2018

This is interesting, especially given that the R version is finding multiple instances of inf reachability past the first point. I'm still unclear on the R code that you posted-- I see you set and eps threshold (0.17), but I don't see you define minPts anywhere. What is R using or defaulting to-- it looks like maybe two points per cluster? Also, there is not max_bound set on our implementation of OPTICS, so the results won't be directly comparable...

Can you rerun with the R version using minPts of 5, and an eps of 10.0, and then do the same for OPTICS but with max_bound set to 10.0? If you provide a link to the R results and generated data, I can spin up OPTICS directly to compare against that.

Thanks @adrinjalali

@adrinjalali
Copy link
Member Author

Trying to come up with a smaller example, here's the R code (it also includes installing the required package, you can just run it if you're curious)

set.seed(2)
n <- 15
x <- cbind(
	x = c(0, 2, 0) + rnorm(n, sd=0.1),
	y = c(2, 0, 0) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

install.packages('dbscan')
library(dbscan)

res = optics(x, eps = 10, minPts = 5)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .5)
cl$cluster
plot(cl)
hullplot(x, cl)

mat = matrix(c(cl$order, cl$reachdist,
			   cl$coredist, cl$predecessor,
			   cl$cluster), ncol=5)
colnames(mat)<- c('order', 'reachdist', 'coredist', 'predecessor', 'cluster')
write.csv(mat, file='/tmp/r.csv')

And the attributes of the object are now:

> cl
OPTICS ordering/clustering for 15 objects.
Parameters: minPts = 5, eps = 10, eps_cl = 0.5, xi = NA
The clustering contains 3 cluster(s) and 0 noise points.

1 2 3 
5 5 5 

Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, cluster

The resulting matrix would be:

  order reachdist coredist predecessor cluster
1 1 Inf 0.333208651272379 NA 1
2 13 0.292663608330619 0.333843791583614 5 3
3 10 0.182346160922536 0.251820141958916 12 2
4 7 0.17871163011086 0.333208651272379 13 1
5 4 0.355622421383634 0.292663608330619 14 3
6 6 1.56320382265559 0.251820141958916 1 2
7 15 0.17871163011086 0.287675880147458 13 1
8 12 0.355622421383634 0.409445643425412 14 3
9 9 0.182346160922536 0.195940809782328 12 2
10 3 0.17871163011086 0.243693355280387 13 1
11 14 0.355622421383634 0.409445643425412 14 3
12 11 0.244256566906985 0.182346160922536 15 2
13 8 0.333208651272379 0.17871163011086 1 1
14 5 1.7015610774674 0.355622421383634 9 3
15 2 0.251820141958916 0.244256566906985 6 2

Here's the data used in this example:

x y
-0.089691454662498 1.76889309153948
2.01848491846467 0.087860458092127
0.158784533120882 0.003580671801523
-0.113037567424629 2.10128286921271
1.9919748243449 0.043226515453962
0.013242028438109 0.209081920524915
0.070795472927173 1.88000741803561
1.97603019758282 0.158963820029007
0.198447393665293 0.195465164222325
-0.013878701211967 2.00049377768281
2.04176507507926 -0.245170638784613
0.098175277746366 0.047723730261362
-0.039269535550381 1.94034418313686
1.89603310230511 0.079220327029965
0.178222896030858 0.028963671017735

And the python code on this same data:

import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5, max_bound=10)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
           np.transpose(np.array([clustering.ordering_,
                                  clustering.reachability_])))

would result in:

ordering reachability core_distances labels
0 inf 0.333208651272384 0
12 0.29266360833062 0.333843791583615 -1
9 0.182346160922537 0.251820141958916 -1
3 0.178711630110863 0.333208651272384 0
6 0.355622421383636 0.29266360833062 -1
5 1.56320382265559 0.251820141958916 -1
11 0.178711630110863 0.287675880147461 0
2 0.29266360833062 0.409445643425412 -1
14 0.182346160922537 0.195940809782328 -1
8 0.178711630110863 0.243693355280386 0
13 0.292663608330621 0.409445643425412 -1
4 0.251820141958916 0.182346160922537 -1
1 0.333208651272384 0.178711630110863 0
7 1.7015610774674 0.355622421383636 -1
10 0.182346160922537 0.244256566906986 -1

The results kinda seem closer now!

@espg
Copy link
Contributor
espg commented Aug 9, 2018

Huh, interesting. The core distances all agree. My initial impression is that the R version is wrong here. The starting point has coordinates (−0.0896914546625, 1.76889309154), and the algorithm should jump to the next closest neighbor... R version goes to point 15, with coordinates (0.178222896030858, 0.028963671017735), and a distance of 1.76044. Sklearn version goes to point 13, with coordinates (−0.0392695355504, 1.94034418314), and a much closer distance of 0.17871. Both algorithms agree on the reachability (and core) distance for point 13, but I can't figure out why the R version orders it number 8 in the ordered list instead of number 2. The points should be ordered according to cluster; points 1, 13, 10, 7, and 4 are all identified as a cluster in both R and Sklearn, but I'd expect them to be adjacent in the ordering scheme (they are in sklearn, they are not at all in R...kinda curious how they got the labeling correct since they aren't in contiguous order, I guess something to do with this predecessor label?).

I'm reviewing the source paper to see if there is something I'm missing for the core algorithm here...

@jnothman
Copy link
Member
jnothman commented Aug 9, 2018 via email

@adrinjalali
Copy link
Member Author

I'm not used to weka, this is what I got out of it:

=== Run information ===

Scheme:       weka.clusterers.OPTICS -E 10.0 -M 5 -A "weka.core.EuclideanDistance -R first-last" -F -no-gui -db-output .
Relation:     data-weka.filters.unsupervised.attribute.Remove-R1
Instances:    15
Attributes:   2
              x
              y
Test mode:    evaluate on training data


=== Clustering model (full training set) ===

OPTICS clustering results
============================================================================================

Clustered DataObjects: 15
Number of attributes: 2
Epsilon: 10.0; minPoints: 5
Write results to file: yes
Distance-type: 
Number of generated clusters: 0
Elapsed time: .0

( 0.) -0.089691,1.768893                        -->  c_dist: 0.142        r_dist: UNDEFINED   
(12.) -0.03927,1.940344                         -->  c_dist: 0.077        r_dist: 0.142
( 3.) -0.113038,2.101283                        -->  c_dist: 0.142        r_dist: 0.077
( 6.) 0.070795,1.880007                         -->  c_dist: 0.127        r_dist: 0.077
( 9.) -0.013879,2.000494                        -->  c_dist: 0.105        r_dist: 0.077
( 5.) 0.013242,0.209082                         -->  c_dist: 0.111        r_dist: 0.666
(11.) 0.098175,0.047724                         -->  c_dist: 0.079        r_dist: 0.111
(14.) 0.178223,0.028964                         -->  c_dist: 0.108        r_dist: 0.079
( 8.) 0.198447,0.195465                         -->  c_dist: 0.086        r_dist: 0.079
( 2.) 0.158785,0.003581                         -->  c_dist: 0.111        r_dist: 0.079
(13.) 1.896033,0.07922                          -->  c_dist: 0.154        r_dist: 0.789
( 1.) 2.018485,0.08786                          -->  c_dist: 0.142        r_dist: 0.154
(10.) 2.041765,-0.245171                        -->  c_dist: 0.175        r_dist: 0.142
( 7.) 1.97603,0.158964                          -->  c_dist: 0.175        r_dist: 0.142
( 4.) 1.991975,0.043227                         -->  c_dist: 0.125        r_dist: 0.142



Time taken to build model (full training data) : 0.01 seconds

=== Model and evaluation on training set ===

Clustered Instances


Unclustered instances : 15

@jnothman
Copy link
Member
jnothman commented Aug 9, 2018 via email

@adrinjalali
Copy link
Member Author

The latest example I used for R and Python in the results I put here were also with the same 15 points. I guess it shows the difference. I was trying to make the example as small as I can.

@jnothman
Copy link
Member
jnothman commented Aug 9, 2018

Oh, I must have missed some comments. Weka looks quite different in its distances?

@amueller
Copy link
Member

what's the status of this?

@espg
Copy link
Contributor
espg commented Aug 16, 2018

I took a look at the original code for automatic extraction from an OPTICS ordered list, and the same bug is present. You can decouple the build of the graph from the graph extraction-- so if you use a different OPTICS extraction function to build the graph on the points in this example and pass it to the unmodified auto_extract code from @amyxzhang, the same bug crops up. In other words, I've isolated where the bug is, but I don't really know how to fix it.

@espg
Copy link
Contributor
espg commented Aug 17, 2018

Following up from @jnothman request in #11838 , here is the reference paper implementation pseudo code.

Converting reachability plot to dendrogram:

dendo

Main algorithm with tree trimming:

main_algo

@amyxzhang
Copy link
amyxzhang commented Aug 17, 2018

Hey all, sorry I just got notice of this issue today! It's been a SUPER long time since I looked at this code but I hope this explanation helps.

Basically the way the algorithm works (as described in the above pseudocode) is that, given a reachability plot, it searches for local maximas to iteratively split the plot, starting from the highest local maxima. A local maxima is a point on the reachability plot where the point to the left is lower and the point to the right is lower and the split that would occur does not create two groups that are now too small (according to given thresholds).

So the bug in the first script by @adrinjalali is because that outlier is at the end of the plot and thus doesn't count as a local maxima given the criteria. Some code could be written to consider potentially rejecting points at the ends of the reachability plot based on some threshold but I don't think it's currently part of the algorithm.

You'll notice that in both the 1st and 2nd script by @adrinjalali, there are middle points in the reachability plots that don't get added to a cluster even though they don't seem like outliers in the scatterplot. @espg is right that this is expected behavior of the algorithm as it stands - those points are splitpoints to divide a cluster in two and don't get added to the new right-hand cluster. Again, this could be revisited using some sort of threshold that looks at the average distance of the cluster to the right of the splitpoint.

There was also a point raised about too many outliers. That depends on the various parameters input towards the extraction algorithm. Also currently, the auto-extractor is only returning leaves of the resulting cluster tree as opposed to all nodes, leading to more outliers, since it's only returning the most densely packed clusters. You could if you wanted to, slice the tree at a particular depth and get a different set of clusters (much as you would in DBSCAN).

In the end, I don't think these issues are showstoppers if there's a deployment going out today.

@adrinjalali
Copy link
Member Author

Thanks @amyxzhang.

If I may, I have one worry regarding labelling those split points as outliers. In practice, with large enough amount of data, people are likely to not investigate why a data point is labelled -1, and simply count them as an outlier.

Here's an undesirable scenario: a data scientist working in a fraud detection department is looking for algorithms to find outliers to raise their risk value, and report them as potential fraud. Since the outliers and split points are labelled the same way, the people whose data happen to be the split points, would potentially be treated unfairly as potential fraud.

That's why I would argue that those split points should either be included in the clusters, or be labelled specifically as a split point (-2 as opposed to -1 for instance).

I know in an ideal world the above scenario wouldn't happen, but unfortunately it's likely that once an algorithm gets into scikit-learn, it's used w/o careful consideration of the details of the algorithm.

That's of course just my one cent contribution to the discussion.

@amyxzhang
Copy link

Yes @adrinjalali, I agree that something could be done about the issues, maybe the strategies I mentioned above, as a final pass over the clusters perhaps. I think there would need to be testing to consider edge cases though. It may be trickier than anticipated with noisier data (lots of potential splitpoints, some of which are discarded). Back when I was working on this, I was mainly interested in the spatial boundary delineation of highly dense areas, so the issue did not come up.

@jnothman
Copy link
Member
jnothman commented Aug 19, 2018 via email

@jnothman
Copy link
Member
jnothman commented Aug 19, 2018 via email

@amyxzhang
Copy link

Right, the implementation follows the algorithm, which doesn't give guidance for splitpoints and outlier points at the very end of the reachability plot. I think some small tweaks, with sufficient testing, can resolve these issues to make these examples conform to expectations.

@jnothman
Copy link
Member
jnothman commented Aug 19, 2018 via email

@adrinjalali
Copy link
Member Author

I'm reading the original paper, and I'll get into the details of the implementation (for my own curiosity). Then I'll probably be able to fix some of these issues. But if anybody has the time and is already familiar with the code/algorithm, please don't wait for me. Anyhow, I'll try my best.

@espg
Copy link
Contributor
espg commented Aug 19, 2018

It may be possible to deal with the split points by lumping them into the left-hand objects if their core distance is small. Since OPTICS removes points from consideration once they've been processed, and always picks the next closest point, the last point in a cluster tends to have large reachability distances and often gets marked as noise. Since we know this, and read the graph from left to right, we could have these points marked as cluster members. This should work for cases that have no noise; for cases that have noise (i.e., where the split point is a noise point not close to any cluster), I'd expect that the core distance would be closer to the reachability distance. For the inverse case (where the spit point is part of the cluster), the reachability distance should be large, but the core distance should be much smaller. If we screen this based off of a ratio of those two numbers:

  • We shouldn't have to calculate any new distances (core and reachability are stored already)
  • We shouldn't have to post process (we can do in the loop / tree build process)
  • We don't have to add new label definitions (i.e., -2 for split points)

The other path I see is doing some sort of post processing of the split points after the routine finishes (i.e., by labeling split points as -2, and then doing nearest neighbor lookups to determine new labels as either cluster members or noise)...but my guess is that this second approach will be more error prone, especially in cases where the split point is noise, but has no other noise points nearby to infer correct labeling.

@adrinjalali
Copy link
Member Author

@espg that sounds reasonable. I've put a version of it in the PR #11857 (threshold hard coded as 1.5).

Please excuse all the debugging prints for now, the patch is around line 675.

I'll try to test it in different scenarios and also fix the "last point being a NOISE" issue.

@kno10
Copy link
Contributor
kno10 commented Sep 10, 2018

Boundary points on spikes in OPTICS plots are a bit tricky to handle.
That is why ELKI uses the predecessor (and R copied that from ELKI).

Erich Schubert, Michael Gertz:
Improving the Cluster Structure Extracted from OPTICS Plots. LWDA 2018: 318-329

@adrinjalali
Copy link
Member Author

Thanks @kno10. Is it the DBSCAN package you're talking about? As far as I know, R only includes the original paper's method (DBSCAN and Xi), with some fixes. Which package are you talking about? I've used the dbscan package as reference.

@kno10
Copy link
Contributor
kno10 commented Sep 10, 2018

@adrinjalali The dbscan package states: "The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation". But the same technique can (and should) be used with any other plot-based extraction technique.
The article mentioned above http://ceur-ws.org/Vol-2191/paper37.pdf explains why the "last" point of a valley may or may not belong to the cluster, and how to use the predecessor information to distinguish these two cases. If you always include the last point, you'll get "spike" artifacts.

@adrinjalali
Copy link
Member Author

Update

Related outstanding

@jnothman jnothman modified the milestones: 0.20, 0.21 Sep 17, 2018
kno10 added a commit to kno10/scikit-learn that referenced this issue Sep 20, 2018
The reachability of the second point *must* be the core-distance of the first point.
Either that point was reached from the first, or both have core distance infinity.

Therefore, the current OPTICS code is incorrect. See also: scikit-learn#11677
@adrinjalali
Copy link
Member Author

Closing this as the sqlnk algorithm having these issues is removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
42A5
None yet
Development

No branches or pull requests

8 participants
0