Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857
Changes from all commits
7dbea25
a0a2d4e
18d5f9e
6dbea77
53a933e
9728b96
c6dd608
e314b65
3b9cbfe
879178c
d52669b
1505879
67ade15
8ff84fd
078afb3
a2cf913
15d9d7f
29a1f0f
1c683a0
d1ce12d
8bb29cf
748f8c2
c40306f
846979f
e654ea6
7ab9dcd
52edc47
58d278c
9d9a4e4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is regarding having an outlier in the last position of
ordering_
. This checks for that the same way it checks if a split point is an outlier, since it would be a split point if it wasn't the last point in theordering_
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems fine to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
reachability_ordering
toordering
since the same ordering applies (and is calculated) for both arrays (reachability and core_distances).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@espg, could you comment here, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@espg, we're still not sure why the code doesn't include all the points here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I think it was to allow for outliers (i.e. if you have an inf or split point). The original code was hierarchical, so you could have nested clusters belonging to the same parent cluster. My guess (and it is a guess) is that the check ratio addresses that case for nested clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this code, I recall it was because I was dealing with many closely nested candidate clusters, where the higher-level candidates in the cluster tree were being rejected for the lower-level candidates that were more tightly formed. I added this bit so I could tune the algorithm to be more inclusive of slightly further away points in the leaf clusters.
For the general purpose of having this in the library, this portion could probably be removed, as I don't think it's part of the original algorithm (@adrinjalali can confirm). This was the only place where I made any additional checks as compared to the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I read the paper yesterday (here for reference), and this is not a part of the main algorithm. However, I wouldn't be opposed to including this as a parameter, with
1
as the default value as is in the paper. We may need to alter the algorithm a bit for the split points issue anyway, and I know the other implementations (at least in R, although not of this algorithm, but the original OPTICS), don't strictly follow the publication and they may include some additions. But that's a different issue and I'm still working on how to articulate my thoughts on it the best I can.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests if exactly 6 clusters are detected and no NOISE.