-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
OPTICS detecting the wrong outlier #11677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ping @GaelVaroquaux |
perhaps our tests weren't strong enough @espg
|
This looks to be a bug in |
Thank you both |
(continuing from #3846)
I really don't think the data keeps having exactly 5 outliers. I also checked the example provided here and the algorithm detects way too many outliers IMO. |
Thanks for investigating, Adrin. @espg are you working on this? I fear we
may need to retract OPTICS from 0.20 if this is not resolved soon.
|
@jnothman I'm on travel and won't be able to dedicate time on this until next week. The bug @adrinjalali first flagged in this thread needs to be investigated, but the recent example above doesn't seem to be incorrect to me. The clusters are widely spaced, so OPTICS evaluates all points as belonging to a given label; the 5 outliers are one point per transition to the next cluster (boundary point where the reachability distance spikes). OPTICS calculates reachability distances starting with the first point to all other points, and then removes the current point from future consideration and recursively repeats. That means that the reachability distances will be low for all points in the cluster under consideration except for the last point on the cluster-- since all it's neighbors have been removed, it will have a reachability distance based on distance to the next cluster (instead of inter-cluster distance), which is quite high in the case above. If you run the same test with DBSCAN, you'll get varying outliers depending on what the eps value is set to, but generally much higher in number for eps under 0.08, and then zero outliers if you set eps greater than twice that... the fact that OPTICS is only predicting one outlier per cluster means it's generally mapping the clusters better then DBSCAN (with less information, since eps doesn't need to be set), so this is a feature, not a bug. I could see the expected behavior for OPTICS would be to have zero outliers for the case(s) above, and that might be achievable with some change to auto_extract...I'll take a look at this next week and see if there is a way to better handle boundary points. |
Thanks!!
|
Maybe we can compare our implementation with R, or implementation from the original author if available. Tagging blocker. |
Is this issue worth delaying the RC release for? It doesn't sound like it will be solved in the next few days... Maybe add this bug fix either in the final 0.20.0 or 0.20.1? |
Here's a result from the For the code:
These are the calculated clusters (no outliers):
And the following plots: |
@adrinjalali can you please compare the reachability plot from R with the output from this implementation of OPTICS? I'm 95% sure that the issue is in the |
@espg unfortunately it doesn't seem to be the case. They're pretty different. Here's the
And here's the
Putting the two generated files together, here's what you'd see.
Does this help? |
This is interesting, especially given that the R version is finding multiple instances of Can you rerun with the R version using minPts of 5, and an eps of 10.0, and then do the same for OPTICS but with max_bound set to 10.0? If you provide a link to the R results and generated data, I can spin up OPTICS directly to compare against that. Thanks @adrinjalali |
Trying to come up with a smaller example, here's the
And the attributes of the object are now:
The resulting matrix would be:
Here's the data used in this example:
And the
would result in:
The results kinda seem closer now! |
Huh, interesting. The core distances all agree. My initial impression is that the R version is wrong here. The starting point has coordinates (−0.0896914546625, 1.76889309154), and the algorithm should jump to the next closest neighbor... R version goes to point 15, with coordinates (0.178222896030858, 0.028963671017735), and a distance of 1.76044. Sklearn version goes to point 13, with coordinates (−0.0392695355504, 1.94034418314), and a much closer distance of 0.17871. Both algorithms agree on the reachability (and core) distance for point 13, but I can't figure out why the R version orders it number 8 in the ordered list instead of number 2. The points should be ordered according to cluster; points 1, 13, 10, 7, and 4 are all identified as a cluster in both R and Sklearn, but I'd expect them to be adjacent in the ordering scheme (they are in sklearn, they are not at all in R...kinda curious how they got the labeling correct since they aren't in contiguous order, I guess something to do with this predecessor label?). I'm reviewing the source paper to see if there is something I'm missing for the core algorithm here... |
Worth comparing to Weka/ELKI?
I really should think about how we can better facilitate cross-language
comparisons for new implementations.
|
I'm not used to weka, this is what I got out of it:
|
thanks! but that's not with the same 100 instances is it? it says 15
|
The latest example I used for R and Python in the results I put here were also with the same 15 points. I guess it shows the difference. I was trying to make the example as small as I can. |
Oh, I must have missed some comments. Weka looks quite different in its distances? |
what's the status of this? |
I took a look at the original code for automatic extraction from an OPTICS ordered list, and the same bug is present. You can decouple the build of the graph from the graph extraction-- so if you use a different OPTICS extraction function to build the graph on the points in this example and pass it to the unmodified auto_extract code from @amyxzhang, the same bug crops up. In other words, I've isolated where the bug is, but I don't really know how to fix it. |
Hey all, sorry I just got notice of this issue today! It's been a SUPER long time since I looked at this code but I hope this explanation helps. Basically the way the algorithm works (as described in the above pseudocode) is that, given a reachability plot, it searches for local maximas to iteratively split the plot, starting from the highest local maxima. A local maxima is a point on the reachability plot where the point to the left is lower and the point to the right is lower and the split that would occur does not create two groups that are now too small (according to given thresholds). So the bug in the first script by @adrinjalali is because that outlier is at the end of the plot and thus doesn't count as a local maxima given the criteria. Some code could be written to consider potentially rejecting points at the ends of the reachability plot based on some threshold but I don't think it's currently part of the algorithm. You'll notice that in both the 1st and 2nd script by @adrinjalali, there are middle points in the reachability plots that don't get added to a cluster even though they don't seem like outliers in the scatterplot. @espg is right that this is expected behavior of the algorithm as it stands - those points are splitpoints to divide a cluster in two and don't get added to the new right-hand cluster. Again, this could be revisited using some sort of threshold that looks at the average distance of the cluster to the right of the splitpoint. There was also a point raised about too many outliers. That depends on the various parameters input towards the extraction algorithm. Also currently, the auto-extractor is only returning leaves of the resulting cluster tree as opposed to all nodes, leading to more outliers, since it's only returning the most densely packed clusters. You could if you wanted to, slice the tree at a particular depth and get a different set of clusters (much as you would in DBSCAN). In the end, I don't think these issues are showstoppers if there's a deployment going out today. |
Thanks @amyxzhang. If I may, I have one worry regarding labelling those split points as outliers. In practice, with large enough amount of data, people are likely to not investigate why a data point is labelled -1, and simply count them as an outlier. Here's an undesirable scenario: a data scientist working in a fraud detection department is looking for algorithms to find outliers to raise their risk value, and report them as potential fraud. Since the outliers and split points are labelled the same way, the people whose data happen to be the split points, would potentially be treated unfairly as potential fraud. That's why I would argue that those split points should either be included in the clusters, or be labelled specifically as a split point (-2 as opposed to -1 for instance). I know in an ideal world the above scenario wouldn't happen, but unfortunately it's likely that once an algorithm gets into scikit-learn, it's used w/o careful consideration of the details of the algorithm. That's of course just my one cent contribution to the discussion. |
Yes @adrinjalali, I agree that something could be done about the issues, maybe the strategies I mentioned above, as a final pass over the clusters perhaps. I think there would need to be testing to consider edge cases though. It may be trickier than anticipated with noisier data (lots of potential splitpoints, some of which are discarded). Back when I was working on this, I was mainly interested in the spatial boundary delineation of highly dense areas, so the issue did not come up. |
I've not yet had the time to understand the details here. But would it be
correct to say that the implementation is faithful to the reference paper,
but that there is an ambiguity in the algorithm that could be handled in a
way that would be friendlier to ?some users.
|
?
|
Right, the implementation follows the algorithm, which doesn't give guidance for splitpoints and outlier points at the very end of the reachability plot. I think some small tweaks, with sufficient testing, can resolve these issues to make these examples conform to expectations. |
well, fwiw it would be very good to get this changed before the imminent
release if we can come up with specified functionality ..
|
I'm reading the original paper, and I'll get into the details of the implementation (for my own curiosity). Then I'll probably be able to fix some of these issues. But if anybody has the time and is already familiar with the code/algorithm, please don't wait for me. Anyhow, I'll try my best. |
It may be possible to deal with the split points by lumping them into the left-hand objects if their core distance is small. Since OPTICS removes points from consideration once they've been processed, and always picks the next closest point, the last point in a cluster tends to have large reachability distances and often gets marked as noise. Since we know this, and read the graph from left to right, we could have these points marked as cluster members. This should work for cases that have no noise; for cases that have noise (i.e., where the split point is a noise point not close to any cluster), I'd expect that the core distance would be closer to the reachability distance. For the inverse case (where the spit point is part of the cluster), the reachability distance should be large, but the core distance should be much smaller. If we screen this based off of a ratio of those two numbers:
The other path I see is doing some sort of post processing of the split points after the routine finishes (i.e., by labeling split points as -2, and then doing nearest neighbor lookups to determine new labels as either cluster members or noise)...but my guess is that this second approach will be more error prone, especially in cases where the split point is noise, but has no other noise points nearby to infer correct labeling. |
Boundary points on spikes in OPTICS plots are a bit tricky to handle.
|
Thanks @kno10. Is it the DBSCAN package you're talking about? As far as I know, R only includes the original paper's method (DBSCAN and Xi), with some fixes. Which package are you talking about? I've used the |
@adrinjalali The |
The reachability of the second point *must* be the core-distance of the first point. Either that point was reached from the first, or both have core distance infinity. Therefore, the current OPTICS code is incorrect. See also: scikit-learn#11677
Closing this as the |
The following example shows
OPTICS
on a small dataset, with one outlier. It seems that the outlier is incorrectly detected.The text was updated successfully, but these errors were encountered: