From 368dad065b8f6beed2c988e73ef1f3e40bed6bd8 Mon Sep 17 00:00:00 2001 From: Vinayak Mehta Date: Mon, 6 Apr 2015 03:02:46 +0530 Subject: [PATCH 1/2] Listed valid metrics in neighbors.rst Reordered algorithms Added doctest Sorted metrics Changed example wording --- doc/modules/neighbors.rst | 88 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/doc/modules/neighbors.rst b/doc/modules/neighbors.rst index f9e01b12a92ad..3c8145157256a 100644 --- a/doc/modules/neighbors.rst +++ b/doc/modules/neighbors.rst @@ -252,6 +252,62 @@ the lower half of those faces. multi-output regression using nearest neighbors. +Nearest Centroid Classifier +=========================== + +The :class:`NearestCentroid` classifier is a simple algorithm that represents +each class by the centroid of its members. In effect, this makes it +similar to the label updating phase of the :class:`sklearn.KMeans` algorithm. +It also has no parameters to choose, making it a good baseline classifier. It +does, however, suffer on non-convex classes, as well as when classes have +drastically different variances, as equal variance in all dimensions is +assumed. See Linear Discriminant Analysis (:class:`sklearn.lda.LDA`) and +Quadratic Discriminant Analysis (:class:`sklearn.qda.QDA`) for more complex +methods that do not make this assumption. Usage of the default +:class:`NearestCentroid` is simple: + + >>> from sklearn.neighbors.nearest_centroid import NearestCentroid + >>> import numpy as np + >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) + >>> y = np.array([1, 1, 1, 2, 2, 2]) + >>> clf = NearestCentroid() + >>> clf.fit(X, y) + NearestCentroid(metric='euclidean', shrink_threshold=None) + >>> print(clf.predict([[-0.8, -1]])) + [1] + + +Nearest Shrunken Centroid +------------------------- + +The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter, +which implements the nearest shrunken centroid classifier. In effect, the value +of each feature for each centroid is divided by the within-class variance of +that feature. The feature values are then reduced by ``shrink_threshold``. Most +notably, if a particular feature value crosses zero, it is set +to zero. In effect, this removes the feature from affecting the classification. +This is useful, for example, for removing noisy features. + +In the example below, using a small shrink threshold increases the accuracy of +the model from 0.81 to 0.82. + +.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_001.png + :target: ../auto_examples/neighbors/plot_classification.html + :scale: 50 + +.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_002.png + :target: ../auto_examples/neighbors/plot_classification.html + :scale: 50 + +.. centered:: |nearest_centroid_1| |nearest_centroid_2| + +.. topic:: Examples: + + * :ref:`example_neighbors_plot_nearest_centroid.py`: an example of + classification using nearest centroid with different shrink thresholds. + + .. _approximate_nearest_neighbors: + Nearest Neighbor Algorithms =========================== @@ -427,6 +483,38 @@ and the ``'effective_metric_'`` is in the ``'VALID_METRICS'`` list of same order as the number of training points, and that ``leaf_size`` is close to its default value of ``30``. +Valid Metrics for Nearest Neighbor Algorithms +--------------------------------------------- + +======================== ================================================================= +Algorithm Valid Metrics +======================== ================================================================= +**Brute Force** 'euclidean', 'l2', 'l1', 'manhattan', 'cityblock', + 'braycurtis', 'canberra', 'chebyshev', 'correlation', + 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', + 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', + 'russellrao', 'seuclidean', 'sokalmichener', + 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski' + +**K-D Tree** 'chebyshev', 'euclidean', 'cityblock', 'manhattan', 'infinity', + 'minkowski', 'p', 'l2', 'l1' + +**Ball Tree** 'chebyshev', 'sokalmichener', 'canberra', 'haversine', + 'rogerstanimoto', 'matching', 'dice', 'euclidean', 'braycurtis', + 'russellrao', 'cityblock', 'manhattan', 'infinity', 'jaccard', + 'seuclidean', 'sokalsneath', 'kulsinski', 'minkowski', + 'mahalanobis', 'p', 'l2', 'hamming', 'l1', 'wminkowski', 'pyfunc' +======================== ================================================================= + +A list of valid metrics for the any of the above algorithms can be obtained by using their +``valid_metric`` attribute. For example, valid metrics for ``KDTree`` can be generated by: + + >>> from sklearn.neighbors import KDTree + >>> import numpy as np + >>> print(np.sort(KDTree.valid_metrics)) # doctest: +ELLIPSIS + ['chebyshev' 'cityblock' 'euclidean' 'infinity' 'l1' 'l2' 'manhattan' + 'minkowski' 'p'] + Effect of ``leaf_size`` ----------------------- As noted above, for small sample sizes a brute force search can be more From 686a256f972c1d0fb516cef86ad02ee839867b46 Mon Sep 17 00:00:00 2001 From: Vinayak Mehta Date: Sat, 12 Sep 2015 01:04:23 +0530 Subject: [PATCH 2/2] Reordered algorithms (after fixing merge conflicts) --- doc/modules/neighbors.rst | 56 --------------------------------------- 1 file changed, 56 deletions(-) diff --git a/doc/modules/neighbors.rst b/doc/modules/neighbors.rst index 3c8145157256a..bbfc255d04706 100644 --- a/doc/modules/neighbors.rst +++ b/doc/modules/neighbors.rst @@ -546,62 +546,6 @@ leaf nodes. The level of this switch can be specified with the parameter .. _nearest_centroid_classifier: -Nearest Centroid Classifier -=========================== - -The :class:`NearestCentroid` classifier is a simple algorithm that represents -each class by the centroid of its members. In effect, this makes it -similar to the label updating phase of the :class:`sklearn.KMeans` algorithm. -It also has no parameters to choose, making it a good baseline classifier. It -does, however, suffer on non-convex classes, as well as when classes have -drastically different variances, as equal variance in all dimensions is -assumed. See Linear Discriminant Analysis (:class:`sklearn.discriminant_analysis.LinearDiscriminantAnalysis`) -and Quadratic Discriminant Analysis (:class:`sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis`) -for more complex methods that do not make this assumption. Usage of the default -:class:`NearestCentroid` is simple: - - >>> from sklearn.neighbors.nearest_centroid import NearestCentroid - >>> import numpy as np - >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) - >>> y = np.array([1, 1, 1, 2, 2, 2]) - >>> clf = NearestCentroid() - >>> clf.fit(X, y) - NearestCentroid(metric='euclidean', shrink_threshold=None) - >>> print(clf.predict([[-0.8, -1]])) - [1] - - -Nearest Shrunken Centroid -------------------------- - -The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter, -which implements the nearest shrunken centroid classifier. In effect, the value -of each feature for each centroid is divided by the within-class variance of -that feature. The feature values are then reduced by ``shrink_threshold``. Most -notably, if a particular feature value crosses zero, it is set -to zero. In effect, this removes the feature from affecting the classification. -This is useful, for example, for removing noisy features. - -In the example below, using a small shrink threshold increases the accuracy of -the model from 0.81 to 0.82. - -.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/sphx_glr_plot_nearest_centroid_001.png - :target: ../auto_examples/neighbors/plot_nearest_centroid.html - :scale: 50 - -.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/sphx_glr_plot_nearest_centroid_002.png - :target: ../auto_examples/neighbors/plot_nearest_centroid.html - :scale: 50 - -.. centered:: |nearest_centroid_1| |nearest_centroid_2| - -.. topic:: Examples: - - * :ref:`sphx_glr_auto_examples_neighbors_plot_nearest_centroid.py`: an example of - classification using nearest centroid with different shrink thresholds. - -.. _approximate_nearest_neighbors: - Approximate Nearest Neighbors =============================