-
-
Notifications
You must be signed in to change notification settings - Fork 26k
OPTICS #11547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
OPTICS #11547
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Shows example usage of OPTICS to extract clustering structure
Mainly issues in long lines, long comments
OPTICS object, with fit method. Documentation updates. extract method
new plot example that matches the updated API updated n_cluster attribute with reruns of extract removed scaling factor on first ‘fit’ run
should pass unit test now?
Scales eps to give stable results from first input distance. Extraction above scaled eps is not allowed; extraction below scaled eps but greater than input distance prints stability warning but still will run clustering extraction. All distances below initial input distance are stable; no warning printed
Fixed noise points from being initialized as type ‘core point’ Fixed initialization for first ‘fit’ call Decoupled eps and eps_prime (deep copy) Matched plot example to same random state as dbscan Added second plot to show ‘extract’ usage
eps is not modified in init; kwargs fix
Why do I have to do this?
Fit now returns self; test fix for unit test fail in ball tree (asarray problem)
labels to labels_
Temporary fix until balltree can be updated to deal with sparse matrices.
Using ‘check_array’
Removed extraneous lines and comments (old commented out code has been removed)
Checking module compatibility with current sklearn build
Added the unit tests in test_optics for fit, extract, and declaration. Should bring coverage to ~100%. Additionally, fixed a small bug that cropped up in the extract function during testing.
Removed unused imports, added to get warning.
see title
Result should be 100% coverage
All public methods now have doc strings; forcing a rebuild of OPTICS so that build tests pass (last round failed due to external module)
99% pep8 now… line 138 isn’t, but reads better with the long variable names
Includes general description, discussion of algorithm output, and comparison with DBSCAN. References and implementation notes are included
Following suggestion from jnothman for doing nneighbors queries enmass. Added OPTICS to cluster init
Also decouples n_jobs -- can set n_jobs for just the kneighbors lookup, while keeping pairwise lookups to single job
also renamed core_dists to core_distances.
narrative changes to tests. Normalized reachability distances for significant_min parameter. Cleaned up plots; small changes to documentation with optics_.py
also minor documentation updates
since labels are 0 indexed, max of labels is (1 - total number of clusters). We can't take len(set(clusters)) because of noise (will be 4, not 3).
Fixed float division, condensed variable names to be shorter, renamed bools to True and False, removed un-needed code block. Still need to add tests for cluster tree extraction :-(
coverage should be complete now; fixed minor bug; removed un-needed check.
also fixed documentation link
entry was supposed to be '1.0' not '10'; the test is supposed to posit 3 clusters, 2 of which are too small and are merged. Old version posited 4 clusters, two of which were merged as intended, and two of which were discarded (cluster merging requires one of the clusters to be large enough to be an independent cluster; with 4 instead of three, this case did not happen for either of the first two clusters).
merge conflict in clustering.rst in previous commit; this push updates the doctest values to current correct values, and resolves the conflict
Restructured documentation. Small unit test fixes. Added test to ensure clustering metrics between OPTICS dbscan_extract and DBSCAN are within 2% of each other.
renamed reachability_ordering --> ordering for consistency
changes unit tests to check for specific error message (instead of 'a' error); minor updates to documentation. This also fixes a bug in the extract_dbscan function whereby some core points were erroneously marked as periphery... this is fixed by reverting to a previous extraction code block that initalizes all points as core and then demotes noise and periphery points during the extract scan. Parameterized unit test.
New invarient test between optics and dbscan
Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).
Merged!!!! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A continuation of #1984. Will merge if tests pass.