OPTICS detecting the wrong outlier #11677

adrinjalali · 2018-07-25T08:20:27Z

The following example shows OPTICS on a small dataset, with one outlier. It seems that the outlier is incorrectly detected.

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [80, 70], [80, 80], [250, 800]])
>>> 
>>> 
>>> np.random.seed(0)
>>> n_points_per_cluster = 10
>>> 
>>> C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
>>> C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
>>> X = np.vstack((C1, C2, np.array([[100, 200]])))
>>> 
>>> clustering = OPTICS(metric="euclidean").fit(X)
>>> np.hstack((X, clustering.labels_[:, np.newaxis]))
array([[ -3.58875812,  -1.67987423,   0.        ],
       [ -4.21700961,  -0.20728544,   0.        ],
       [ -3.50595361,  -2.7818223 ,   0.        ],
       [ -4.23992927,  -2.12108577,   0.        ],
       [ -5.08257508,  -1.6715212 ,   0.        ],
       [ -4.88476514,  -0.83658119,   0.        ],
       [ -4.39116982,  -1.90265999,   0.        ],
       [ -4.64490941,  -1.73306054,   0.        ],
       [ -3.80473674,  -2.16412661,   0.        ],
       [ -4.74954584,  -2.68327659,   0.        ],
       [  3.74470102,  -0.93463814,  -1.        ],
       [  4.08644362,  -1.0742165 ,   1.        ],
       [  4.22697546,  -1.14543657,   1.        ],
       [  4.00457585,  -1.01871839,   1.        ],
       [  4.15327792,  -0.85306412,   1.        ],
       [  4.01549474,  -0.96218375,   1.        ],
       [  3.91122143,  -1.19807965,   1.        ],
       [  3.96520879,  -0.9843651 ,   1.        ],
       [  4.12302907,  -0.87976202,   1.        ],
       [  3.96126732,  -1.03023028,   1.        ],
       [100.        , 200.        ,   1.        ]])
>>> clustering # doctest: +NORMALIZE_WHITESPACE
OPTICS(algorithm='ball_tree', leaf_size=30, max_bound=inf, maxima_ratio=0.75,
    metric='euclidean', metric_params=None, min_cluster_size_ratio=0.005,
    min_maxima_ratio=0.001, min_samples=5, n_jobs=1, p=2,
    rejection_ratio=0.7, significant_min=0.003, similarity_threshold=0.4)

The text was updated successfully, but these errors were encountered:

adrinjalali · 2018-07-25T08:42:34Z

ping @GaelVaroquaux

jnothman · 2018-07-25T09:56:38Z

perhaps our tests weren't strong enough @espg

espg · 2018-07-25T15:43:54Z

This looks to be a bug in _automatic_cluster. I'm guessing that it occurs when the leaf in the reachability tree is too small, and is merged to the adjacent leaf, instead of being marked as noise / removed. I'll see if I can find a fix (I think around lines 709 or 725).

jnothman · 2018-07-25T21:52:50Z

Thank you both

adrinjalali · 2018-07-31T15:35:35Z

(continuing from #3846)
@qinhanmin2014 I don't think I can come up with a bug-free example. Check the following:

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> 
>>> # Generate sample data
>>> 
>>> np.random.seed(0)
>>> for _ in range(10):
...     n_points_per_cluster = 250
...     C1 = [-50, -20] + .08 * np.random.randn(n_points_per_cluster, 2)
...     C2 = [40, -10] + .01 * np.random.randn(n_points_per_cluster, 2)
...     C3 = [10, -20] + .02 * np.random.randn(n_points_per_cluster, 2)
...     C4 = [-20, 30] + .03 * np.random.randn(n_points_per_cluster, 2)
...     C5 = [30, -20] + .06 * np.random.randn(n_points_per_cluster, 2)
...     C6 = [50, 60] + .02 * np.random.randn(n_points_per_cluster, 2)
...     X = np.vstack((C1, C2, C3, C4, C5, C6))
...     clustering = OPTICS(min_samples=9, rejection_ratio=0.5)
...     # Run the fit
...     clustering.fit(X)
...     print(np.sum(clustering.labels_ == -1))
5
5
5
5
5
5
5
5
5
5

I really don't think the data keeps having exactly 5 outliers. I also checked the example provided here and the algorithm detects way too many outliers IMO.

jnothman · 2018-07-31T22:31:35Z

Thanks for investigating, Adrin. @espg are you working on this? I fear we may need to retract OPTICS from 0.20 if this is not resolved soon.

espg · 2018-07-31T23:52:53Z

@jnothman I'm on travel and won't be able to dedicate time on this until next week. The bug @adrinjalali first flagged in this thread needs to be investigated, but the recent example above doesn't seem to be incorrect to me. The clusters are widely spaced, so OPTICS evaluates all points as belonging to a given label; the 5 outliers are one point per transition to the next cluster (boundary point where the reachability distance spikes). OPTICS calculates reachability distances starting with the first point to all other points, and then removes the current point from future consideration and recursively repeats. That means that the reachability distances will be low for all points in the cluster under consideration except for the last point on the cluster-- since all it's neighbors have been removed, it will have a reachability distance based on distance to the next cluster (instead of inter-cluster distance), which is quite high in the case above. If you run the same test with DBSCAN, you'll get varying outliers depending on what the eps value is set to, but generally much higher in number for eps under 0.08, and then zero outliers if you set eps greater than twice that... the fact that OPTICS is only predicting one outlier per cluster means it's generally mapping the clusters better then DBSCAN (with less information, since eps doesn't need to be set), so this is a feature, not a bug.

I could see the expected behavior for OPTICS would be to have zero outliers for the case(s) above, and that might be achievable with some change to auto_extract...I'll take a look at this next week and see if there is a way to better handle boundary points.

jnothman · 2018-07-31T23:58:46Z

Thanks!!

qinhanmin2014 · 2018-08-01T00:16:46Z

Maybe we can compare our implementation with R, or implementation from the original author if available. Tagging blocker.

rth · 2018-08-01T08:29:02Z

Is this issue worth delaying the RC release for? It doesn't sound like it will be solved in the next few days... Maybe add this bug fix either in the final 0.20.0 or 0.20.1?

adrinjalali · 2018-08-01T08:49:33Z

Here's a result from the R package (dbscan)

For the code:

set.seed(2)
n <- 400
x <- cbind(
  x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
  y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)

library(dbscan)

res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)

These are the calculated clusters (no outliers):

> cl$cluster
  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
 [60] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[119] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1
[178] 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
[237] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3
[296] 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2
[355] 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

And the following plots:

espg · 2018-08-07T20:51:40Z

@adrinjalali can you please compare the reachability plot from R with the output from this implementation of OPTICS? I'm 95% sure that the issue is in the auto_extract function, but it would be good to be positive, and I'm not really familiar with R. If the bug is in auto_extract, the calculated and ordered reachability distances should be identical to within machine precision from the two implementations.

adrinjalali · 2018-08-08T15:56:17Z

@espg unfortunately it doesn't seem to be the case. They're pretty different.

Here's the R code I used to create the data and get the reachability:

set.seed(2)
n <- 100
x <- cbind(
  x = c(0, 2, 0, 2) + rnorm(n, sd=0.1),
  y = c(2, 0, 0, 2) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

library(dbscan)

res = optics(x)
#predict(res, x)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .17)
cl$cluster
plot(cl)
hullplot(x, cl)

write.csv(matrix(c(cl$order, cl$reachdist), ncol=2), file='/tmp/r.csv')

And here's the python code to apply OPTICS on the same data:

import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
           np.transpose(np.array([clustering.ordering_,
                                  clustering.reachability_])))

Putting the two generated files together, here's what you'd see.

python order	python reachability	R order	R reachability
0	inf	1	Inf
52	0.07329448816232	57	Inf
56	0.071566498480906	73	Inf
12	0.086188714113748	53	Inf
72	0.030358668261071	13	0.03035866826107
40	1.58645318232263	89	0.091487092011897
88	0.064431760283397	49	0.064431760283397
48	0.08831105629047	41	0.088311056290458
4	0.115609686828167	25	0.115609686828167
24	0.077812333737916	5	0.077812333737916
68	0.079401452929697	69	0.079401452929697
64	0.088171163308115	65	0.088171163308108
28	0.073131402725988	81	0.073131402725989
80	0.094438516074332	29	0.094438516074335
36	1.57660870727407	37	0.070793201620562
76	1.56178634568536	77	0.108586894358706
44	0.102916640800507	45	0.102916640800511
...	...	...	...

Does this help?

espg · 2018-08-08T17:47:18Z

This is interesting, especially given that the R version is finding multiple instances of inf reachability past the first point. I'm still unclear on the R code that you posted-- I see you set and eps threshold (0.17), but I don't see you define minPts anywhere. What is R using or defaulting to-- it looks like maybe two points per cluster? Also, there is not max_bound set on our implementation of OPTICS, so the results won't be directly comparable...

Can you rerun with the R version using minPts of 5, and an eps of 10.0, and then do the same for OPTICS but with max_bound set to 10.0? If you provide a link to the R results and generated data, I can spin up OPTICS directly to compare against that.

Thanks @adrinjalali

adrinjalali · 2018-08-08T21:48:32Z

Trying to come up with a smaller example, here's the R code (it also includes installing the required package, you can just run it if you're curious)

set.seed(2)
n <- 15
x <- cbind(
	x = c(0, 2, 0) + rnorm(n, sd=0.1),
	y = c(2, 0, 0) + rnorm(n, sd=0.1)
)

write.csv(x, file='/tmp/data.csv')

install.packages('dbscan')
library(dbscan)

res = optics(x, eps = 10, minPts = 5)
plot(res)
plot(x, col = "grey")
polygon(x[res$order,])
cl <- extractDBSCAN(res, eps_cl = .5)
cl$cluster
plot(cl)
hullplot(x, cl)

mat = matrix(c(cl$order, cl$reachdist,
			   cl$coredist, cl$predecessor,
			   cl$cluster), ncol=5)
colnames(mat)<- c('order', 'reachdist', 'coredist', 'predecessor', 'cluster')
write.csv(mat, file='/tmp/r.csv')

And the attributes of the object are now:

> cl
OPTICS ordering/clustering for 15 objects.
Parameters: minPts = 5, eps = 10, eps_cl = 0.5, xi = NA
The clustering contains 3 cluster(s) and 0 noise points.

1 2 3 
5 5 5 

Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi, cluster

The resulting matrix would be:

	order	reachdist	coredist	predecessor	cluster
1	1	Inf	0.333208651272379	NA	1
2	13	0.292663608330619	0.333843791583614	5	3
3	10	0.182346160922536	0.251820141958916	12	2
4	7	0.17871163011086	0.333208651272379	13	1
5	4	0.355622421383634	0.292663608330619	14	3
6	6	1.56320382265559	0.251820141958916	1	2
7	15	0.17871163011086	0.287675880147458	13	1
8	12	0.355622421383634	0.409445643425412	14	3
9	9	0.182346160922536	0.195940809782328	12	2
10	3	0.17871163011086	0.243693355280387	13	1
11	14	0.355622421383634	0.409445643425412	14	3
12	11	0.244256566906985	0.182346160922536	15	2
13	8	0.333208651272379	0.17871163011086	1	1
14	5	1.7015610774674	0.355622421383634	9	3
15	2	0.251820141958916	0.244256566906985	6	2

Here's the data used in this example:

x	y
-0.089691454662498	1.76889309153948
2.01848491846467	0.087860458092127
0.158784533120882	0.003580671801523
-0.113037567424629	2.10128286921271
1.9919748243449	0.043226515453962
0.013242028438109	0.209081920524915
0.070795472927173	1.88000741803561
1.97603019758282	0.158963820029007
0.198447393665293	0.195465164222325
-0.013878701211967	2.00049377768281
2.04176507507926	-0.245170638784613
0.098175277746366	0.047723730261362
-0.039269535550381	1.94034418313686
1.89603310230511	0.079220327029965
0.178222896030858	0.028963671017735

And the python code on this same data:

import pandas as pd
import numpy as np
from sklearn.cluster import OPTICS
data = pd.read_csv('/tmp/data.csv')
data = data[['x', 'y']]
clustering = OPTICS(min_samples=5, max_bound=10)
clustering.fit(data)
np.savetxt("/tmp/py.csv",
           np.transpose(np.array([clustering.ordering_,
                                  clustering.reachability_])))

would result in:

ordering	reachability	core_distances	labels
0	inf	0.333208651272384	0
12	0.29266360833062	0.333843791583615	-1
9	0.182346160922537	0.251820141958916	-1
3	0.178711630110863	0.333208651272384	0
6	0.355622421383636	0.29266360833062	-1
5	1.56320382265559	0.251820141958916	-1
11	0.178711630110863	0.287675880147461	0
2	0.29266360833062	0.409445643425412	-1
14	0.182346160922537	0.195940809782328	-1
8	0.178711630110863	0.243693355280386	0
13	0.292663608330621	0.409445643425412	-1
4	0.251820141958916	0.182346160922537	-1
1	0.333208651272384	0.178711630110863	0
7	1.7015610774674	0.355622421383636	-1
10	0.182346160922537	0.244256566906986	-1

The results kinda seem closer now!

espg · 2018-08-09T00:00:38Z

Huh, interesting. The core distances all agree. My initial impression is that the R version is wrong here. The starting point has coordinates (−0.0896914546625, 1.76889309154), and the algorithm should jump to the next closest neighbor... R version goes to point 15, with coordinates (0.178222896030858, 0.028963671017735), and a distance of 1.76044. Sklearn version goes to point 13, with coordinates (−0.0392695355504, 1.94034418314), and a much closer distance of 0.17871. Both algorithms agree on the reachability (and core) distance for point 13, but I can't figure out why the R version orders it number 8 in the ordered list instead of number 2. The points should be ordered according to cluster; points 1, 13, 10, 7, and 4 are all identified as a cluster in both R and Sklearn, but I'd expect them to be adjacent in the ordering scheme (they are in sklearn, they are not at all in R...kinda curious how they got the labeling correct since they aren't in contiguous order, I guess something to do with this predecessor label?).

I'm reviewing the source paper to see if there is something I'm missing for the core algorithm here...

jnothman · 2018-08-09T00:26:37Z

Worth comparing to Weka/ELKI? I really should think about how we can better facilitate cross-language comparisons for new implementations.

adrinjalali · 2018-08-09T08:23:29Z

I'm not used to weka, this is what I got out of it:

=== Run information ===

Scheme:       weka.clusterers.OPTICS -E 10.0 -M 5 -A "weka.core.EuclideanDistance -R first-last" -F -no-gui -db-output .
Relation:     data-weka.filters.unsupervised.attribute.Remove-R1
Instances:    15
Attributes:   2
              x
              y
Test mode:    evaluate on training data


=== Clustering model (full training set) ===

OPTICS clustering results
============================================================================================

Clustered DataObjects: 15
Number of attributes: 2
Epsilon: 10.0; minPoints: 5
Write results to file: yes
Distance-type: 
Number of generated clusters: 0
Elapsed time: .0

( 0.) -0.089691,1.768893                        -->  c_dist: 0.142        r_dist: UNDEFINED   
(12.) -0.03927,1.940344                         -->  c_dist: 0.077        r_dist: 0.142
( 3.) -0.113038,2.101283                        -->  c_dist: 0.142        r_dist: 0.077
( 6.) 0.070795,1.880007                         -->  c_dist: 0.127        r_dist: 0.077
( 9.) -0.013879,2.000494                        -->  c_dist: 0.105        r_dist: 0.077
( 5.) 0.013242,0.209082                         -->  c_dist: 0.111        r_dist: 0.666
(11.) 0.098175,0.047724                         -->  c_dist: 0.079        r_dist: 0.111
(14.) 0.178223,0.028964                         -->  c_dist: 0.108        r_dist: 0.079
( 8.) 0.198447,0.195465                         -->  c_dist: 0.086        r_dist: 0.079
( 2.) 0.158785,0.003581                         -->  c_dist: 0.111        r_dist: 0.079
(13.) 1.896033,0.07922                          -->  c_dist: 0.154        r_dist: 0.789
( 1.) 2.018485,0.08786                          -->  c_dist: 0.142        r_dist: 0.154
(10.) 2.041765,-0.245171                        -->  c_dist: 0.175        r_dist: 0.142
( 7.) 1.97603,0.158964                          -->  c_dist: 0.175        r_dist: 0.142
( 4.) 1.991975,0.043227                         -->  c_dist: 0.125        r_dist: 0.142



Time taken to build model (full training data) : 0.01 seconds

=== Model and evaluation on training set ===

Clustered Instances


Unclustered instances : 15

jnothman · 2018-08-09T08:26:17Z

thanks! but that's not with the same 100 instances is it? it says 15

adrinjalali · 2018-08-09T08:50:47Z

The latest example I used for R and Python in the results I put here were also with the same 15 points. I guess it shows the difference. I was trying to make the example as small as I can.

jnothman · 2018-08-09T09:21:44Z

Oh, I must have missed some comments. Weka looks quite different in its distances?

amueller · 2018-08-16T21:30:47Z

what's the status of this?

espg · 2018-08-16T21:46:18Z

I took a look at the original code for automatic extraction from an OPTICS ordered list, and the same bug is present. You can decouple the build of the graph from the graph extraction-- so if you use a different OPTICS extraction function to build the graph on the points in this example and pass it to the unmodified auto_extract code from @amyxzhang, the same bug crops up. In other words, I've isolated where the bug is, but I don't really know how to fix it.

espg · 2018-08-17T19:31:04Z

Following up from @jnothman request in #11838 , here is the reference paper implementation pseudo code.

Converting reachability plot to dendrogram:

Main algorithm with tree trimming:

amyxzhang · 2018-08-17T23:00:47Z

Hey all, sorry I just got notice of this issue today! It's been a SUPER long time since I looked at this code but I hope this explanation helps.

Basically the way the algorithm works (as described in the above pseudocode) is that, given a reachability plot, it searches for local maximas to iteratively split the plot, starting from the highest local maxima. A local maxima is a point on the reachability plot where the point to the left is lower and the point to the right is lower and the split that would occur does not create two groups that are now too small (according to given thresholds).

So the bug in the first script by @adrinjalali is because that outlier is at the end of the plot and thus doesn't count as a local maxima given the criteria. Some code could be written to consider potentially rejecting points at the ends of the reachability plot based on some threshold but I don't think it's currently part of the algorithm.

You'll notice that in both the 1st and 2nd script by @adrinjalali, there are middle points in the reachability plots that don't get added to a cluster even though they don't seem like outliers in the scatterplot. @espg is right that this is expected behavior of the algorithm as it stands - those points are splitpoints to divide a cluster in two and don't get added to the new right-hand cluster. Again, this could be revisited using some sort of threshold that looks at the average distance of the cluster to the right of the splitpoint.

There was also a point raised about too many outliers. That depends on the various parameters input towards the extraction algorithm. Also currently, the auto-extractor is only returning leaves of the resulting cluster tree as opposed to all nodes, leading to more outliers, since it's only returning the most densely packed clusters. You could if you wanted to, slice the tree at a particular depth and get a different set of clusters (much as you would in DBSCAN).

In the end, I don't think these issues are showstoppers if there's a deployment going out today.

adrinjalali · 2018-08-18T12:46:00Z

Thanks @amyxzhang.

If I may, I have one worry regarding labelling those split points as outliers. In practice, with large enough amount of data, people are likely to not investigate why a data point is labelled -1, and simply count them as an outlier.

Here's an undesirable scenario: a data scientist working in a fraud detection department is looking for algorithms to find outliers to raise their risk value, and report them as potential fraud. Since the outliers and split points are labelled the same way, the people whose data happen to be the split points, would potentially be treated unfairly as potential fraud.

That's why I would argue that those split points should either be included in the clusters, or be labelled specifically as a split point (-2 as opposed to -1 for instance).

I know in an ideal world the above scenario wouldn't happen, but unfortunately it's likely that once an algorithm gets into scikit-learn, it's used w/o careful consideration of the details of the algorithm.

That's of course just my one cent contribution to the discussion.

amyxzhang · 2018-08-18T18:09:11Z

Yes @adrinjalali, I agree that something could be done about the issues, maybe the strategies I mentioned above, as a final pass over the clusters perhaps. I think there would need to be testing to consider edge cases though. It may be trickier than anticipated with noisier data (lots of potential splitpoints, some of which are discarded). Back when I was working on this, I was mainly interested in the spatial boundary delineation of highly dense areas, so the issue did not come up.

jnothman · 2018-08-19T01:04:49Z

I've not yet had the time to understand the details here. But would it be correct to say that the implementation is faithful to the reference paper, but that there is an ambiguity in the algorithm that could be handled in a way that would be friendlier to ?some users.

jnothman · 2018-08-19T01:04:54Z

?

amyxzhang · 2018-08-19T02:07:34Z

Right, the implementation follows the algorithm, which doesn't give guidance for splitpoints and outlier points at the very end of the reachability plot. I think some small tweaks, with sufficient testing, can resolve these issues to make these examples conform to expectations.

jnothman · 2018-08-19T02:17:52Z

well, fwiw it would be very good to get this changed before the imminent release if we can come up with specified functionality ..

adrinjalali · 2018-08-19T14:26:52Z

I'm reading the original paper, and I'll get into the details of the implementation (for my own curiosity). Then I'll probably be able to fix some of these issues. But if anybody has the time and is already familiar with the code/algorithm, please don't wait for me. Anyhow, I'll try my best.

espg · 2018-08-19T18:12:19Z

It may be possible to deal with the split points by lumping them into the left-hand objects if their core distance is small. Since OPTICS removes points from consideration once they've been processed, and always picks the next closest point, the last point in a cluster tends to have large reachability distances and often gets marked as noise. Since we know this, and read the graph from left to right, we could have these points marked as cluster members. This should work for cases that have no noise; for cases that have noise (i.e., where the split point is a noise point not close to any cluster), I'd expect that the core distance would be closer to the reachability distance. For the inverse case (where the spit point is part of the cluster), the reachability distance should be large, but the core distance should be much smaller. If we screen this based off of a ratio of those two numbers:

We shouldn't have to calculate any new distances (core and reachability are stored already)
We shouldn't have to post process (we can do in the loop / tree build process)
We don't have to add new label definitions (i.e., -2 for split points)

The other path I see is doing some sort of post processing of the split points after the routine finishes (i.e., by labeling split points as -2, and then doing nearest neighbor lookups to determine new labels as either cluster members or noise)...but my guess is that this second approach will be more error prone, especially in cases where the split point is noise, but has no other noise points nearby to infer correct labeling.

adrinjalali · 2018-08-19T19:15:41Z

@espg that sounds reasonable. I've put a version of it in the PR #11857 (threshold hard coded as 1.5).

Please excuse all the debugging prints for now, the patch is around line 675.

I'll try to test it in different scenarios and also fix the "last point being a NOISE" issue.

kno10 · 2018-09-10T00:09:17Z

Boundary points on spikes in OPTICS plots are a bit tricky to handle.
That is why ELKI uses the predecessor (and R copied that from ELKI).

Erich Schubert, Michael Gertz:
Improving the Cluster Structure Extracted from OPTICS Plots. LWDA 2018: 318-329

adrinjalali · 2018-09-10T07:39:58Z

Thanks @kno10. Is it the DBSCAN package you're talking about? As far as I know, R only includes the original paper's method (DBSCAN and Xi), with some fixes. Which package are you talking about? I've used the dbscan package as reference.

kno10 · 2018-09-10T09:27:44Z

@adrinjalali The dbscan package states: "The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation". But the same technique can (and should) be used with any other plot-based extraction technique.
The article mentioned above http://ceur-ws.org/Vol-2191/paper37.pdf explains why the "last" point of a valley may or may not belong to the cluster, and how to use the predecessor information to distinguish these two cases. If you always include the last point, you'll get "spike" artifacts.

adrinjalali · 2018-09-10T11:27:36Z

Update

The issue with the multiple inf reachability values, PR [MRG+1] OPTICS correctly handle multiple infs in reachability array. #12029 fixes the issue (needs a second review)
The issue with the metric parameter, almost fixed, it still doesn't support sparse input, issue OPTICS metric param docstring and parameter handling needs improvements #12009, OPTICS should support sparse matrices and precomputed distances #11982, PR [MRG+1] Fix optics metric issues (DOC and precomputed) #12028 (needs a second review)
Handling the split points borrowing the xi-steep concept from the original paper: PR [WIP] OPTICS include split points in clusters if not noise #12049 (needs review)
Compare floats in quick_scan with isclose: PR [MRG+1] OPTICS Change quick_scan floating point comparison to isclose #11929 (waiting to be merged)

Related outstanding

Make the reachability_ values consistent with a reference implementation (See [MRG+1] OPTICS Change quick_scan floating point comparison to isclose #11929 (comment))
Either add a parameter related to maxima_ratio check, or compare with all members of the cluster (See [WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857 (comment))

The reachability of the second point *must* be the core-distance of the first point. Either that point was reached from the first, or both have core distance infinity. Therefore, the current OPTICS code is incorrect. See also: scikit-learn#11677

adrinjalali · 2019-03-21T18:17:31Z

Closing this as the sqlnk algorithm having these issues is removed.

jnothman added the Bug label Jul 26, 2018

jnothman added this to the 0.20 milestone Jul 26, 2018

adrinjalali mentioned this issue Jul 31, 2018

Add examples to class docs #3846

Closed

23 tasks

qinhanmin2014 added the Blocker label Aug 1, 2018

jnothman mentioned this issue Aug 16, 2018

DOC Release dates for 0.20RC #11838

Merged

adrinjalali mentioned this issue Aug 20, 2018

[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857

Closed

adrinjalali mentioned this issue Sep 7, 2018

OPTICS extraction methods #12036

Closed

adrinjalali mentioned this issue Sep 10, 2018

[WIP] OPTICS include split points in clusters if not noise #12049

Closed

This was referenced Sep 12, 2018

Remove OPTICS from 0.20 #12053

Merged

[MRG+2] OPTICS: add extract_xi method #12077

Merged

jnothman modified the milestones: 0.20, 0.21 Sep 17, 2018

espg mentioned this issue Sep 19, 2018

OPTICS test_reach_dists inappropriate #12090

Closed

kno10 mentioned this issue Sep 20, 2018

Add (failing) test for OPTICS #12123

Closed

kno10 mentioned this issue Oct 11, 2018

[MRG+1] Fix OPTICS processing order #12357

Merged

adrinjalali closed this as completed Mar 21, 2019

Uh oh!

OPTICS detecting the wrong outlier #11677

OPTICS detecting the wrong outlier #11677

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update

Related outstanding

Uh oh!

Uh oh!