8000 OPTICS by GaelVaroquaux · Pull Request #11547 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

OPTICS #11547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 139 commits into from
Jul 16, 2018
Merged

OPTICS #11547

merged 139 commits into from
Jul 16, 2018

Conversation

GaelVaroquaux
Copy link
Member

A continuation of #1984. Will merge if tests pass.

espg and others added 30 commits May 21, 2013 17:26
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Shows example usage of OPTICS to extract clustering structure
Mainly issues in long lines, long comments
OPTICS object, with fit method. Documentation updates.
extract method
new plot example that matches the updated API
updated n_cluster attribute with reruns of extract
removed scaling factor on first ‘fit’ run
should pass unit test now?
Scales eps to give stable results from first input distance. Extraction
above scaled eps is not allowed; extraction below scaled eps but
greater than input distance prints stability warning but still will run
clustering extraction. All distances below initial input distance are
stable; no warning printed
Fixed noise points from being initialized as type ‘core point’
Fixed initialization for first ‘fit’ call
Decoupled eps and eps_prime (deep copy)

Matched plot example to same random state as dbscan
Added second plot to show ‘extract’ usage
eps is not modified in init; kwargs fix
Why do I have to do this?
Fit now returns self; test fix for unit test fail in ball tree (asarray
problem)
labels to labels_
Temporary fix until balltree can be updated to deal with sparse
matrices.
Using ‘check_array’
Removed extraneous lines and comments (old commented out code has been
removed)
Checking module compatibility with current sklearn build
Added the unit tests in test_optics for fit, extract, and declaration.
Should bring coverage to ~100%. Additionally, fixed a small bug that
cropped up in the extract function during testing.
Removed unused imports, added to get warning.
Result should be 100% coverage
All public methods now have doc strings; forcing a rebuild of OPTICS so
that build tests pass (last round failed due to external module)
99% pep8 now… line 138 isn’t, but reads better with the long variable
names
Includes general description, discussion of algorithm output, and
comparison with DBSCAN. References and implementation notes are included
Following suggestion from jnothman for doing nneighbors queries enmass.
Added OPTICS to cluster init
espg and others added 27 commits February 13, 2018 15:44
Also decouples n_jobs -- can set n_jobs for just the kneighbors lookup, while keeping pairwise lookups to single job
also renamed core_dists to core_distances.
narrative changes to tests. Normalized reachability distances for significant_min parameter. Cleaned up plots; small changes to documentation with optics_.py
also minor documentation updates
since labels are 0 indexed, max of labels is (1 - total number of clusters). We can't take len(set(clusters)) because of noise (will be 4, not 3).
Fixed float division, condensed variable names to be shorter, renamed bools to True and False, removed un-needed code block. Still need to add tests for cluster tree extraction :-(
coverage should be complete now; fixed minor bug; removed un-needed check.
also fixed documentation link
entry was supposed to be '1.0' not '10'; the test is supposed to posit 3 clusters, 2 of which are too small and are merged. Old version posited 4 clusters, two of which were merged as intended, and two of which were discarded (cluster merging requires one of the clusters to be large enough to be an independent cluster; with 4 instead of three, this case did not happen for either of the first two clusters).
merge conflict in clustering.rst in previous commit; this push updates the doctest values to current correct values, and resolves the conflict
Restructured documentation. Small unit test fixes. Added test to ensure clustering metrics between OPTICS dbscan_extract and DBSCAN are within 2% of each other.
renamed reachability_ordering --> ordering for consistency
changes unit tests to check for specific error message (instead of 'a' error); minor updates to documentation. This also fixes a bug in the extract_dbscan function whereby some core points were erroneously marked as periphery... this is fixed by reverting to a previous extraction code block that initalizes all points as core and then demotes noise and periphery points during the extract scan. Parameterized unit test.
New invarient test between optics and dbscan
Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).
@GaelVaroquaux GaelVaroquaux merged commit 7d745ee into scikit-learn:master Jul 16, 2018
@GaelVaroquaux
Copy link
Member Author

Merged!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0