8000 [MERGE] Merge changes from sklearn main by adam2392 · Pull Request #52 · neurodata/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MERGE] Merge changes from sklearn main #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Aug 11, 2023
Merged
Changes from 1 commit
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
7de59b2
FIX Correct the initiatialization of `precisions_cholesky_` from `pre…
mchikyt3 Jul 20, 2023
889b829
DOC Added information about space complexity to docs DBSCAN (#26783)
StefanieSenger Jul 20, 2023
b13f69c
DOC Directly import label class in example (#26876)
lucyleeow Jul 21, 2023
399131c
DOC update description of support_vectors_ (#26866)
rprkh Jul 21, 2023
c135445
DOC fix broken links (#26853)
DimitriPapadopoulos Jul 21, 2023
0486033
PERF Pass buffers via pointers in `PairwiseDistancesReductions` routi…
Micky774 Jul 21, 2023
75f3e47
FIX ravel prediction of `PLSRegression` when fitted on 1d `y` (#26602)
Charlie-XIAO Jul 24, 2023
ca51d77
DOC Add docstring DistanceMetric class (#26795)
greyisbetter Jul 24, 2023
d66a384
CI Add summary about failures and errors in most builds (#26847)
lesteve Jul 24, 2023
d991a19
MAINT make sure to test encoders in common tests (#26859)
glemaitre Jul 24, 2023
507095b
DOC Specify primal/dual formulation in LogisticRegression (#26294)
mlondschien Jul 25, 2023
07f6586
MNT SLEP6 move common metadata routing test objects (#26894)
adrinjalali Jul 25, 2023
59048f9
FIX Update pairwise distance function argument names (#26351)
Micky774 Jul 25, 2023
44d4cd4
FIX Allow 0<p<1 for Minkowski metric regardless of X's dtype (#26760)
Shreesha3112 Jul 26, 2023
c2f5782
DOC use the same estimators to demonstrate pipeline construction (#26…
noashin Jul 26, 2023
b6dd04e
DOC example on feature selection using negative `tol` values (#26205)
rprkh Jul 26, 2023
e54f678
MNT Improve robustness of sparse test in `HDBSCAN` (#26889)
Micky774 Jul 27, 2023
9e09e4d
MNT Fixed linting error in `plot_select_from_model_diabetes.py` (#26915)
Micky774 Jul 27, 2023
8f63882
DOC Improve `plot_target_encoder_cross_val.py` example (#26677)
lucyleeow Jul 27, 2023
b8d4f46
FIX fix validation of class_names argument for plot_tree (#26903)
2maz Jul 27, 2023
4094851
ENH Adds support for missing values in Random Forest (#26391)
thomasjpfan Jul 27, 2023
36c5073
MNT (SLEP6) remove other_params from provess_routing (#26909)
adrinjalali Jul 27, 2023
699690f
MAINT Parameters validation for sklearn.cluster.dbscan (#26920)
lpsilvestrin Jul 27, 2023
1090121
DOC Add missing cross validation image alt (#26261)
marekhanus Jul 28, 2023
2b0eef8
DOC Note missing value support as advantage of decision trees (#26928)
fabianegli Jul 28, 2023
dc9d0eb
DOC backticks around X and y in linear_model.rst (#26929)
LukasFolwarczny Jul 30, 2023
fa87f28
DOC update related packages (#26922)
lorentzenchr Jul 30, 2023
3dea102
FIX Disable set_output for label encoders (#26940)
thomasjpfan Jul 31, 2023
dcf0510
FIX Adds more informative error message for OHE (#26931)
thomasjpfan Jul 31, 2023
405a5a0
DOC Fixed typo, added missing comma in plot_forest_hist_grad_boosting…
Tialo Jul 31, 2023
ec41b3e
MNT make type checkers happy with set_{method}_request methods (#26911)
adrinjalali Aug 1, 2023
cd1d432
MNT Raise on set_score_request if SLEP006 is not enabled (#26856)
adrinjalali Aug 1, 2023
d498cda
MNT name it X_train in GradientBoosting (#26959)
lorentzenchr Aug 1, 2023
1bd831a
ENH Add `float32` implementations for `BallTree` and `KDTree` (#25914)
OmarManzoor Aug 1, 2023
fdd3941
FIX Fix inconsistent naming convention for algorithm selection of HDB…
Shreesha3112 Aug 1, 2023
5c4e9a0
ENH Improves memory usage and runtime for gradient boosting (#26957)
thomasjpfan Aug 1, 2023
a16d367
API Allow users to pass `DistanceMetric` objects to `metric` keyword …
Micky774 Aug 1, 2023
43dbec5
MNT Use `curl` instead of `wget` to avoid occasional `SSL` error on C…
Micky774 Aug 1, 2023
b06a7d2
FIX Pop unnecessary elements from `metric_kwargs` in `datasets_pair.p…
Micky774 Aug 2, 2023
b4fcce8
DOC Reduce whitespace above h1 tag (#26787)
thomasjpfan Aug 2, 2023
7c6e1e9
DOC Add random_state to all classifiers in plot_classifier_comparison…
TamaraAtanasoska Aug 2, 2023
672ed45
CI Cross compile wheel macos wheels on github actions (#26985)
thomasjpfan Aug 2, 2023
da09e96
CI Only test latest python version on CirrusCI (#26986)
thomasjpfan Aug 2, 2023
21e63ee
MAINT Remove flake8 mentions/ignore comments (#26988)
lucyleeow Aug 2, 2023
db91568
DOC Corrected changelog entry tag for PR 26765 (#26994)
Micky774 Aug 2, 2023
fde46d6
TST Improves testing for missing value support in random forest (#26939)
thomasjpfan Aug 2, 2023
9c96671
DOC Add 2 related projects for microcontroller export (#26984)
jonnor Aug 2, 2023
594475a
FIX (SLEP6) make Pipeline work with an estimator implementing __len__…
adrinjalali Aug 2, 2023
38a06e4
DOC improve the KNN classifier example (#26993)
glemaitre Aug 3, 2023
5f8d89e
DOC Fix miniforge link with typo in install.rst (#27019)
hiramatsuyuusuke Aug 7, 2023
aa36aac
MNT Fix good Conda versions for updating lockfile (#26908)
maresb Aug 7, 2023
3725ac1
FIX `param_distribution` param of `HalvingRandomSearchCV` accepts li…
StefanieSenger Aug 7, 2023
392c084
MNT Exported `WeightingStrategy` for `*_classmode` reductions (#27030)
Micky774 Aug 8, 2023
05133a5
CI Only run arm tests nightly (#26996)
thomasjpfan Aug 8, 2023
62b9e4a
ENH: Update numpy exceptions imports (#27013)
mtsokol Aug 8, 2023
34c4741
FIX missing_indices were calculated twice in OrdinalEncoder (#27017)
xuefeng-xu Aug 8, 2023
ed01199
MAINT DOC HGBT leave updated if loss is not smooth (#26254)
lorentzenchr Aug 8, 2023
e04b8e7
FIX user keyword missing in v1.4 change log (#27036)
xuefeng-xu Aug 8, 2023
687465f
FIX KNNImputer missing indicator column addition when add_indicator=T…
Shreesha3112 Aug 8, 2023
5ecfa8d
CLN Update var name in `TargetEncoder` to make consistent (#27033)
8000 lucyleeow Aug 8, 2023
7d2da31
ENH Add themes for HTML display. Add dark theme (#26862)
9Y5 Aug 9, 2023
6fa514a
ENH add metadata routing to cross_val* (#26896)
adrinjalali Aug 9, 2023
8e867b3
MNT fix ruff type vs isinstance errors (#27039)
adrinjalali Aug 9, 2023
fcaf0ff
MNT Use assert_no_warnings from numpy.testing (#27031)
thomasjpfan Aug 9, 2023
438b919
FIX potentially redundant marker argument (#27043)
ArturoAmorQ Aug 9, 2023
1b0a51b
Add tests for train_test_split with Array API input (#26855)
betatim Aug 9, 2023
e4efd8b
FIX Fixes set_output with list input (#27044)
thomasjpfan Aug 10, 2023
94a0b4c
DOC Highlight differerence between SVC/R and LinearSVC/R (#26825)
StefanieSenger Aug 10, 2023
1a78993
ENH Gaussian mixture bypassing unnecessary initialization computing (…
jiawei-zhang-a Aug 10, 2023
acf60de
ENH Introduce dtype preservation semantics in `DistanceMetric` object…
Micky774 Aug 10, 2023
1e7a069
Add missing value support for random forests
adam2392 Aug 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
DOC Add docstring DistanceMetric class (scikit-learn#26795)
  • Loading branch information
greyisbetter authored Jul 24, 2023
commit ca51d77cc28edd57fd22ac2b4a5edf47ce79e3ff
131 changes: 128 additions & 3 deletions sklearn/metrics/_dist_metrics.pyx.tp
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,118 @@ def get_valid_metric_ids(L):
if (val.__name__ in L) or (val in L)]

cdef class DistanceMetric:
"""Uniform interface for fast distance metric functions.

The `DistanceMetric` class provides a convenient way to compute pairwise distances
between samples. It supports various distance metrics, such as Euclidean distance,
Manhattan distance, and more.

The `pairwise` method can be used to compute pairwise distances between samples in
the input arrays. It returns a distance matrix representing the distances between
all pairs of samples.

The :meth:`get_metric` method allows you to retrieve a specific metric using its
string identifier.

Examples
--------
>>> from sklearn.metrics import DistanceMetric
>>> dist = DistanceMetric.get_metric('euclidean')
>>> X = [[1, 2], [3, 4], [5, 6]]
>>> Y = [[7, 8], [9, 10]]
>>> dist.pairwise(X,Y)
array([[7.81..., 10.63...]
[5.65..., 8.48...]
[1.41..., 4.24...]])

Available Metrics

The following lists the string metric identifiers and the associated
distance metric classes:

**Metrics intended for real-valued vector spaces:**

============== ==================== ======== ===============================
identifier class name args distance function
-------------- -------------------- -------- -------------------------------
"euclidean" EuclideanDistance - ``sqrt(sum((x - y)^2))``
"manhattan" ManhattanDistance - ``sum(|x - y|)``
"chebyshev" ChebyshevDistance - ``max(|x - y|)``
"minkowski" MinkowskiDistance p, w ``sum(w * |x - y|^p)^(1/p)``
"seuclidean" SEuclideanDistance V ``sqrt(sum((x - y)^2 / V))``
"mahalanobis" MahalanobisDistance V or VI ``sqrt((x - y)' V^-1 (x - y))``
============== ==================== ======== ===============================

**Metrics intended for two-dimensional vector spaces:** Note that the haversine
distance metric requires data in the form of [latitude, longitude] and both
inputs and outputs are in units of radians.

============ ================== ===============================================================
identifier class name distance function
------------ ------------------ ---------------------------------------------------------------
"haversine" HaversineDistance ``2 arcsin(sqrt(sin^2(0.5*dx) + cos(x1)cos(x2)sin^2(0.5*dy)))``
============ ================== ===============================================================


**Metrics intended for integer-valued vector spaces:** Though intended
for integer-valued vectors, these are also valid metrics in the case of
real-valued vectors.

============= ==================== ========================================
identifier class name distance function
------------- -------------------- ----------------------------------------
"hamming" HammingDistance ``N_unequal(x, y) / N_tot``
"canberra" CanberraDistance ``sum(|x - y| / (|x| + |y|))``
"braycurtis" BrayCurtisDistance ``sum(|x - y|) / (sum(|x|) + sum(|y|))``
============= ==================== ========================================

**Metrics intended for boolean-valued vector spaces:** Any nonzero entry
is evaluated to "True". In the listings below, the following
abbreviations are used:

- N : number of dimensions
- NTT : number of dims in which both values are True
- NTF : number of dims in which the first value is True, second is False
- NFT : number of dims in which the first value is False, second is True
- NFF : number of dims in which both values are False
- NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT
- NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT

================= ======================= ===============================
identifier class name distance function
----------------- ----------------------- -------------------------------
"jaccard" JaccardDistance NNEQ / NNZ
"matching" MatchingDistance NNEQ / N
"dice" DiceDistance NNEQ / (NTT + NNZ)
"kulsinski" KulsinskiDistance (NNEQ + N - NTT) / (NNEQ + N)
"rogerstanimoto" RogersTanimotoDistance 2 * NNEQ / (N + NNEQ)
"russellrao" RussellRaoDistance (N - NTT) / N
"sokalmichener" SokalMichenerDistance 2 * NNEQ / (N + NNEQ)
"sokalsneath" SokalSneathDistance NNEQ / (NNEQ + 0.5 * NTT)
================= ======================= ===============================

**User-defined distance:**

=========== =============== =======
identifier class name args
----------- --------------- -------
"pyfunc" PyFuncDistance func
=========== =============== =======

Here ``func`` is a function which takes two one-dimensional numpy
arrays, and returns a distance. Note that in order to be used within
the BallTree, the distance must be a true metric:
i.e. it must satisfy the following properties

1) Non-negativity: d(x, y) >= 0
2) Identity: d(x, y) = 0 if and only if x == y
3) Symmetry: d(x, y) = d(y, x)
4) Triangle Inequality: d(x, y) + d(y, z) >= d(x, z)

Because of the Python object overhead involved in calling the python
function, this will be fairly slow, but it will have the same
scaling as other distances.
"""
@classmethod
def get_metric(cls, metric, dtype=np.float64, **kwargs):
"""Get the given distance metric from the string identifier.
Expand All @@ -74,11 +186,24 @@ cdef class DistanceMetric:
Parameters
----------
metric : str or class name
The distance metric to use
The string identifier or class name of the desired distance metric.
See the documentation of the `DistanceMetric` class for a list of
available metrics.

dtype : {np.float32, np.float64}, default=np.float64
The dtype of the data on which the metric will be applied
The data type of the input on which the metric will be applied.
This affects the precision of the computed distances.
By default, it is set to `np.float64`.

**kwargs
additional arguments will be passed to the requested metric
Additional keyword arguments that will be passed to the requested metric.
These arguments can be used to customize the behavior of the specific
metric.

Returns
-------
metric_obj : instance of the requested metric
An instance of the requested distance metric class.
"""
if dtype == np.float32:
specialized_class = DistanceMetric32
Expand Down
0