-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG + 2 -.5] Listed valid metrics for neighbors algorithms #4525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -252,6 +252,62 @@ the lower half of those faces. | |
multi-output regression using nearest neighbors. | ||
|
||
|
||
Nearest Centroid Classifier | ||
=========================== | ||
|
||
The :class:`NearestCentroid` classifier is a simple algorithm that represents | ||
each class by the centroid of its members. In effect, this makes it | ||
similar to the label updating phase of the :class:`sklearn.KMeans` algorithm. | ||
It also has no parameters to choose, making it a good baseline classifier. It | ||
does, however, suffer on non-convex classes, as well as when classes have | ||
drastically different variances, as equal variance in all dimensions is | ||
assumed. See Linear Discriminant Analysis (:class:`sklearn.lda.LDA`) and | ||
Quadratic Discriminant Analysis (:class:`sklearn.qda.QDA`) for more complex | ||
methods that do not make this assumption. Usage of the default | ||
:class:`NearestCentroid` is simple: | ||
|
||
>>> from sklearn.neighbors.nearest_centroid import NearestCentroid | ||
>>> import numpy as np | ||
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) | ||
>>> y = np.array([1, 1, 1, 2, 2, 2]) | ||
>>> clf = NearestCentroid() | ||
>>> clf.fit(X, y) | ||
NearestCentroid(metric='euclidean', shrink_threshold=None) | ||
>>> print(clf.predict([[-0.8, -1]])) | ||
[1] | ||
|
||
|
||
Nearest Shrunken Centroid | ||
------------------------- | ||
|
||
The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter, | ||
which implements the nearest shrunken centroid classifier. In effect, the value | ||
of each feature for each centroid is divided by the within-class variance of | ||
that feature. The feature values are then reduced by ``shrink_threshold``. Most | ||
notably, if a particular feature value crosses zero, it is set | ||
to zero. In effect, this removes the feature from affecting the classification. | ||
This is useful, for example, for removing noisy features. | ||
|
||
In the example below, using a small shrink threshold increases the accuracy of | ||
the model from 0.81 to 0.82. | ||
|
||
.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_001.png | ||
:target: ../auto_examples/neighbors/plot_classification.html | ||
:scale: 50 | ||
|
||
.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_002.png | ||
:target: ../auto_examples/neighbors/plot_classification.html | ||
:scale: 50 | ||
|
||
.. centered:: |nearest_centroid_1| |nearest_centroid_2| | ||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`example_neighbors_plot_nearest_centroid.py`: an example of | ||
classification using nearest centroid with different shrink thresholds. | ||
|
||
.. _approximate_nearest_neighbors: | ||
|
||
Nearest Neighbor Algorithms | ||
=========================== | ||
|
||
|
@@ -427,6 +483,38 @@ and the ``'effective_metric_'`` is in the ``'VALID_METRICS'`` list of | |
same order as the number of training points, and that ``leaf_size`` is | ||
close to its default value of ``30``. | ||
|
||
Valid Metrics for Nearest Neighbor Algorithms | ||
--------------------------------------------- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also add a doctests that shows the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This shows the attributes to the users, but also makes sure that a doctests fails in this place if someone modifies the valid metrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure how to do that :\ (doctest showing |
||
|
||
======================== ================================================================= | ||
Algorithm Valid Metrics | ||
======================== ================================================================= | ||
**Brute Force** 'euclidean', 'l2', 'l1', 'manhattan', 'cityblock', | ||
'braycurtis', 'canberra', 'chebyshev', 'correlation', | ||
'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', | ||
'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', | ||
'russellrao', 'seuclidean', 'sokalmichener', | ||
'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski' | ||
|
||
**K-D Tree** 'chebyshev', 'euclidean', 'cityblock', 'manhattan', 'infinity', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should note that |
||
'minkowski', 'p', 'l2', 'l1' | ||
|
||
**Ball Tree** 'chebyshev', 'sokalmichener', 'canberra', 'haversine', | ||
'rogerstanimoto', 'matching', 'dice', 'euclidean', 'braycurtis', | ||
'russellrao', 'cityblock', 'manhattan', 'infinity', 'jaccard', | ||
'seuclidean', 'sokalsneath', 'kulsinski', 'minkowski', | ||
'mahalanobis', 'p', 'l2', 'hamming', 'l1', 'wminkowski', 'pyfunc' | ||
======================== ================================================================= | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this comment10000The reason will be displayed to describe this comment to others. Learn more. You mean like it has been done in the Usage examples under that table in model_selection? So I should do something like this here? >>> from sklearn.neighbors import KDTree >>> print(KDTree.valid_metrics) Sorry for being noob. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, and then list the output. |
||
A list of valid metrics for the any of the above algorithms can be obtained by using their | ||
``valid_metric`` attribute. For example, valid metrics for ``KDTree`` can be generated by: | ||
|
||
>>> from sklearn.neighbors import KDTree | ||
>>> import numpy as np | ||
>>> print(np.sort(KDTree.valid_metrics)) # doctest: +ELLIPSIS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @raghavrv I don't think we need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. indeed!! |
||
['chebyshev' 'cityblock' 'euclidean' 'infinity' 'l1' 'l2' 'manhattan' | ||
'minkowski' 'p'] | ||
|
||
Effect of ``leaf_size`` | ||
----------------------- | ||
As noted above, for small sample sizes a brute force search can be more | ||
|
@@ -458,62 +546,6 @@ leaf nodes. The level of this switch can be specified with the parameter | |
|
||
.. _nearest_centroid_classifier: | ||
|
||
Nearest Centroid Classifier | ||
=========================== | ||
|
||
The :class:`NearestCentroid` classifier is a simple algorithm that represents | ||
each class by the centroid of its members. In effect, this makes it | ||
similar to the label updating phase of the :class:`sklearn.KMeans` algorithm. | ||
It also has no parameters to choose, making it a good baseline classifier. It | ||
does, however, suffer on non-convex classes, as well as when classes have | ||
drastically different variances, as equal variance in all dimensions is | ||
assumed. See Linear Discriminant Analysis (:class:`sklearn.discriminant_analysis.LinearDiscriminantAnalysis`) | ||
and Quadratic Discriminant Analysis (:class:`sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis`) | ||
for more complex methods that do not make this assumption. Usage of the default | ||
:class:`NearestCentroid` is simple: | ||
|
||
>>> from sklearn.neighbors.nearest_centroid import NearestCentroid | ||
>>> import numpy as np | ||
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) | ||
>>> y = np.array([1, 1, 1, 2, 2, 2]) | ||
>>> clf = NearestCentroid() | ||
>>> clf.fit(X, y) | ||
NearestCentroid(metric='euclidean', shrink_threshold=None) | ||
>>> print(clf.predict([[-0.8, -1]])) | ||
[1] | ||
|
||
|
||
Nearest Shrunken Centroid | ||
------------------------- | ||
|
||
The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter, | ||
which implements the nearest shrunken centroid classifier. In effect, the value | ||
of each feature for each centroid is divided by the within-class variance of | ||
that feature. The feature values are then reduced by ``shrink_threshold``. Most | ||
notably, if a particular feature value crosses zero, it is set | ||
to zero. In effect, this removes the feature from affecting the classification. | ||
This is useful, for example, for removing noisy features. | ||
|
||
In the example below, using a small shrink threshold increases the accuracy of | ||
the model from 0.81 to 0.82. | ||
|
||
.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/sphx_glr_plot_nearest_centroid_001.png | ||
:target: ../auto_examples/neighbors/plot_nearest_centroid.html | ||
:scale: 50 | ||
|
||
.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/sphx_glr_plot_nearest_centroid_002.png | ||
:target: ../auto_examples/neighbors/plot_nearest_centroid.html | ||
:scale: 50 | ||
|
||
.. centered:: |nearest_centroid_1| |nearest_centroid_2| | ||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_neighbors_plot_nearest_centroid.py`: an example of | ||
classification using nearest centroid with different shrink thresholds. | ||
|
||
.. _approximate_nearest_neighbors: | ||
|
||
Approximate Nearest Neighbors | ||
============================= | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this now in the right place? There is no deletions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Yes, as per #4521 (comment) its in the right place above the details on the trees and neighbors implementations.