8000 [MRG + 2 -.5] Listed valid metrics for neighbors algorithms by vinayak-mehta · Pull Request #4525 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG + 2 -.5] Listed valid metrics for neighbors algorithms #4525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 88 additions & 56 deletions doc/modules/neighbors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,62 @@ the lower half of those faces.
multi-output regression using nearest neighbors.


Nearest Centroid Classifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now in the right place? There is no deletions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Yes, as per #4521 (comment) its in the right place above the details on the trees and neighbors implementations.

===========================

The :class:`NearestCentroid` classifier is a simple algorithm that represents
each class by the centroid of its members. In effect, this makes it
similar to the label updating phase of the :class:`sklearn.KMeans` algorithm.
It also has no parameters to choose, making it a good baseline classifier. It
does, however, suffer on non-convex classes, as well as when classes have
drastically different variances, as equal variance in all dimensions is
assumed. See Linear Discriminant Analysis (:class:`sklearn.lda.LDA`) and
Quadratic Discriminant Analysis (:class:`sklearn.qda.QDA`) for more complex
methods that do not make this assumption. Usage of the default
:class:`NearestCentroid` is simple:

>>> from sklearn.neighbors.nearest_centroid import NearestCentroid
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = NearestCentroid()
>>> clf.fit(X, y)
NearestCentroid(metric='euclidean', shrink_threshold=None)
>>> print(clf.predict([[-0.8, -1]]))
[1]


Nearest Shrunken Centroid
-------------------------

The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter,
which implements the nearest shrunken centroid classifier. In effect, the value
of each feature for each centroid is divided by the within-class variance of
that feature. The feature values are then reduced by ``shrink_threshold``. Most
notably, if a particular feature value crosses zero, it is set
to zero. In effect, this removes the feature from affecting the classification.
This is useful, for example, for removing noisy features.

In the example below, using a small shrink threshold increases the accuracy of
the model from 0.81 to 0.82.

.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_001.png
:target: ../auto_examples/neighbors/plot_classification.html
:scale: 50

.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/plot_nearest_centroid_002.png
:target: ../auto_examples/neighbors/plot_classification.html
:scale: 50

.. centered:: |nearest_centroid_1| |nearest_centroid_2|

.. topic:: Examples:

* :ref:`example_neighbors_plot_nearest_centroid.py`: an example of
classification using nearest centroid with different shrink thresholds.

.. _approximate_nearest_neighbors:

Nearest Neighbor Algorithms
===========================

Expand Down Expand Up @@ -427,6 +483,38 @@ and the ``'effective_metric_'`` is in the ``'VALID_METRICS'`` list of
same order as the number of training points, and that ``leaf_size`` is
close to its default value of ``30``.

Valid Metrics for Nearest Neighbor Algorithms
---------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a doctests that shows the valid_metrics class attributes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows the attributes to the users, but also makes sure that a doctests fails in this place if someone modifies the valid metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to do that :\ (doctest showing valid_metrics class attributes) and searching didn't help. Can you point me to an example?


======================== =================================================================
Algorithm Valid Metrics
======================== =================================================================
**Brute Force** 'euclidean', 'l2', 'l1', 'manhattan', 'cityblock',
'braycurtis', 'canberra', 'chebyshev', 'correlation',
'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski',
'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto',
'russellrao', 'seuclidean', 'sokalmichener',
'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'

**K-D Tree** 'chebyshev', 'euclidean', 'cityblock', 'manhattan', 'infinity',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note that modules/generated/sklearn.neighbors.DistanceMetric.html lists the measures for K-D and Ball Trees, along with their arguments. Or else we should just be pointing to that reference here rather than listing measures, so as to avoid duplication.

'minkowski', 'p', 'l2', 'l1'

**Ball Tree** 'chebyshev', 'sokalmichener', 'canberra', 'haversine',
'rogerstanimoto', 'matching', 'dice', 'euclidean', 'braycurtis',
'russellrao', 'cityblock', 'manhattan', 'infinity', 'jaccard',
'seuclidean', 'sokalsneath', 'kulsinski', 'minkowski',
'mahalanobis', 'p', 'l2', 'hamming', 'l1', 'wminkowski', 'pyfunc'
======================== =================================================================

Copy link
Contributor Author

Choose a reason for hiding this comment

10000

The reason will be displayed to describe this comment to others. Learn more.

You mean like it has been done in the Usage examples under that table in model_selection? So I should do something like this here?

>>> from sklearn.neighbors import KDTree
>>> print(KDTree.valid_metrics)

Sorry for being noob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and then list the output.

A list of valid metrics for the any of the above algorithms can be obtained by using their
``valid_metric`` attribute. For example, valid metrics for ``KDTree`` can be generated by:

>>> from sklearn.neighbors import KDTree
>>> import numpy as np
>>> print(np.sort(KDTree.valid_metrics)) # doctest: +ELLIPSIS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghavrv I don't think we need numpy for this doctest, a normal sorted would work just fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed!!

['chebyshev' 'cityblock' 'euclidean' 'infinity' 'l1' 'l2' 'manhattan'
'minkowski' 'p']

Effect of ``leaf_size``
-----------------------
As noted above, for small sample sizes a brute force search can be more
Expand Down Expand Up @@ -458,62 +546,6 @@ leaf nodes. The level of this switch can be specified with the parameter

.. _nearest_centroid_classifier:

Nearest Centroid Classifier
===========================

The :class:`NearestCentroid` classifier is a simple algorithm that represents
each class by the centroid of its members. In effect, this makes it
similar to the label updating phase of the :class:`sklearn.KMeans` algorithm.
It also has no parameters to choose, making it a good baseline classifier. It
does, however, suffer on non-convex classes, as well as when classes have
drastically different variances, as equal variance in all dimensions is
assumed. See Linear Discriminant Analysis (:class:`sklearn.discriminant_analysis.LinearDiscriminantAnalysis`)
and Quadratic Discriminant Analysis (:class:`sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis`)
for more complex methods that do not make this assumption. Usage of the default
:class:`NearestCentroid` is simple:

>>> from sklearn.neighbors.nearest_centroid import NearestCentroid
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = NearestCentroid()
>>> clf.fit(X, y)
NearestCentroid(metric='euclidean', shrink_threshold=None)
>>> print(clf.predict([[-0.8, -1]]))
[1]


Nearest Shrunken Centroid
-------------------------

The :class:`NearestCentroid` classifier has a ``shrink_threshold`` parameter,
which implements the nearest shrunken centroid classifier. In effect, the value
of each feature for each centroid is divided by the within-class variance of
that feature. The feature values are then reduced by ``shrink_threshold``. Most
notably, if a particular feature value crosses zero, it is set
to zero. In effect, this removes the feature from affecting the classification.
This is useful, for example, for removing noisy features.

In the example below, using a small shrink threshold increases the accuracy of
the model from 0.81 to 0.82.

.. |nearest_centroid_1| image:: ../auto_examples/neighbors/images/sphx_glr_plot_nearest_centroid_001.png
:target: ../auto_examples/neighbors/plot_nearest_centroid.html
:scale: 50

.. |nearest_centroid_2| image:: ../auto_examples/neighbors/images/sphx_glr_plot_nearest_centroid_002.png
:target: ../auto_examples/neighbors/plot_nearest_centroid.html
:scale: 50

.. centered:: |nearest_centroid_1| |nearest_centroid_2|

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_neighbors_plot_nearest_centroid.py`: an example of
classification using nearest centroid with different shrink thresholds.

.. _approximate_nearest_neighbors:

Approximate Nearest Neighbors
=============================

Expand Down
0