8000 KNeighborsClassifier: allow p value for Minkowski calculation to be less than 1.0 (and greater than 0) · Issue #22811 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

KNeighborsClassifier: allow p value for Minkowski calculation to be less than 1.0 (and greater than 0) #22811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
raymondj-pace opened this issue Mar 13, 2022 · 28 comments · Fixed by #24750

Comments

@raymondj-pace
Copy link

Describe the workflow you want to enable

I would like to be able to use the KNeighborsClassifier with something like:

neigh = KNeighborsClassifier(n_neighbors=2, p=0.1)

The error you get is:

File ~/opt/anaconda3/envs/tensorflow-3/lib/python3.10/site-packages/sklearn/neighbors/_base.py:395, in NeighborsBase._check_algorithm_metric(self)
    392     effective_p = self.p
    394 if self.metric in ["wminkowski", "minkowski"] and effective_p < 1:
--> 395     raise ValueError("p must be greater or equal to one for minkowski metric")

ValueError: p must be greater or equal to one for minkowski metric

Change the above line to limit minkowski p value to < 0:

if self.metric in ["wminkowski", "minkowski"] and effective_p < 0:

There is nothing wrong with using a p value > 0 and < 1.

Describe your proposed solution

Don't throw an exception if p is < 1.0.

Calculating Minkowski distance is most definitely valid for p > 0 and p < 1. I.e.: 0.1, 0.2, 0.3, 0.4, etc

The only requirement is compute for each dimension raised to the p power. In this case **p and after the sum of all dimensions. Take the p-th root of the sum of dimensions: distance = distance_sum**p

There are many cases where it is desirable to compute minkowski > 0 and < 1

Describe alternatives you've considered, if relevant

Write my own kNN classifier with my own minkowski distance calculator:

def minkowski_distance(row1, row2, p=3):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += abs(row1[i] - row2[i])**p
    return distance**(1/p)

Additional context

No response

@raymondj-pace raymondj-pace added Needs Triage Issue requires triage New Feature labels Mar 13, 2022
@jeremiedbb
Copy link
Member

The issue is that minkowski with p< 1 is not a metric (no triangular inequality). Then the question is do we want to support similarity measures that are not metrics ?

@raymondj-pace
Copy link
Author

Yes, I agree it's not a metric. My advisor who spent his entire life studying distance (Euclidean and non-Eucliean) would say that Minkowski < 1 has much value. I have a problem where I need to iterate over different values of p for Minkowski but there is a way around this issue (for p < 1). I can use a callable for my own Minkowski method and pass a parameter like this.

Here "minkowski_distance" is my method...

mink_p = 0.5
step = 0.1

while mink_p <= 2.5:
    
    neigh = KNeighborsClassifier(n_neighbors=1, metric=minkowski_distance, metric_params={"minkowski_param" : mink_p})
    neigh.fit(X_ref_normalized, y_ref)

    ...

    mink_p += step
    mink_p = round(mink_p, 1)

This will solve my original problem and for anyone else that wants p < 0 for Minkowski.

I have another very closely related issue to KNeightborsClassifier - if it needs a new issue I can open it.

What about adding a "weights_params" option? Just like in the above code there is "metric" and "metric_params" but for weights it's just weights (for a callable) - but no weights_params.

I have a case where I need to iterate over the value of p for the weights method and I have do something like this:

for w in [weights_p_5, weights_p1, weights_p1_5, weights_p2, weights_p2_5]:
    
    neigh = KNeighborsClassifier(n_neighbors=k, weights=w)
    neigh.fit(X_ref_normalized, y_ref)
    ...

In this example weights_p_5 is a method that computes the weight as 1/(dist**.5), weights_p1_5 computes weights as 1/(dist**1.5).

How about adding a weights_params option just like there is for metric (both callable and a dictionary parameter to pass.)

How about including something like this new param "weight_param":

weight_p = 0.5
step = 0.5

while weight_p <= 2.5:
    
    neigh = KNeighborsClassifier(n_neighbors=1, weights=my_weight_method, weights_params={"weight_param" : weight_p})
    neigh.fit(X_ref_normalized, y_ref)

    ...

    weight_p += step
    weight_p = round(weight_p, 1)

Thanks -Ray

@eschibli
Copy link
Contributor

The issue is that minkowski with p< 1 is not a metric (no triangular inequality). Then the question is do we want to support similarity measures that are not metrics ?

That seems like a typical example of where it's best to allow that behaviour but raise a warning.

@adrinjalali
Copy link
Member

@raymondj-pace would you mind sharing some references for us to see how this would be useful and how it would be sensible to apply KNN on a non-metric function like this?

@adrinjalali adrinjalali added Needs Info and removed Needs Triage Issue requires triage labels Mar 25, 2022
@raymondj-pace
Copy link
Author

One is here: [Combining Minkowski and Chebyshev - arXivhttps://arxiv.org › pdf] https://arxiv.org/pdf/2112.12549

Scroll down to page 69 - p = 0.5, 0.75.

The same diagrams for p < 1 are also on wikipedia as well: https://en.wikipedia.org/wiki/Minkowski_distance

@adrinjalali
Copy link
Member

@scikit-learn/core-devs are we happy with the literature on p<1 here?

@jjerphan
Copy link
Member
jjerphan commented Apr 11, 2022

I think that I would rather not support it as a DistanceMetric since it is not a distance metric for the reason given by @jeremiedbb above.

Note that you might still define your own distance using the pyfunc argument to support those cases potentially, see the documentation.

@lorentzenchr
Copy link
Member

This functionality is available via scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) and KNeighborsClassifier accepts a callable as parameter.
I'm fine with either the current behaviour (not allowing p<1) or at least throwing a warning.

@adrinjalali
Copy link
Member

So should we document this one instead of implementing it?

@ogrisel
Copy link
Member
ogrisel commented Sep 5, 2022

I am fine with allowing p<1.0 it in KNNClassifier without a warning when using the bruteforce method as it's perfectly fine to compute the datapoints with the lowest pseudo-metric values.

But we should still raise an exception when using ball-tree because then the triangle inequality is required to return correct results.

The DistanceMetric subclasses could expose an additional attribute true_metric or something (with a boolean value) to make it explicit

@ogrisel
Copy link
Member
ogrisel commented Sep 5, 2022

I am fine with allowing p<1.0 it in KNNClassifier without a warning when using the bruteforce method as it's perfectly fine to compute the datapoints with the lowest pseudo-metric values.

Actually a pseudo metric would still satisfy the triangular inequality. It's just that d(x, y) = 0 does not imply the identity x = y.

But still the KNNClassifier results would still be correct if the neighbors are computed with the bruteforce method. I would just document that it's not a metric for p<1.0 in the docstring.

@jeremiedbb
Copy link
Member

Actually a pseudo metric would still satisfy the triangular inequality.

Actually minkowski with p<1 does not satisfy triangle inequality (take points (0,0), (1,1) and (0,1), example from wikipedia)

@lorentzenchr
Copy link
Member

What's our conclusion?

  1. Do not allow p<1 and raise an error (current behaviour).
    Action: Improve error message and close this issue.
  2. Allow it and raise a warning.
    Action: PR to allow p<1 with warnings and tests.

I'll vote for 1. as I don't have seen a convincing use case to allow it, and on top of it, it is available via passing scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) to KNeighborsClassifier.

@ogrisel
Copy link
Member
ogrisel commented Sep 7, 2022

Actually minkowski with p<1 does not satisfy triangle inequality (take points (0,0), (1,1) and (0,1), example from wikipedia)

Yes I agree, I was correcting my previous comment where I made a bad use of the word "pseudometric".

Still, bruteforce knn is well defined for p<1, so I don't see why we should block it. But I agree that we should prevent running the ball-tree (and even more the kd-tree) algorithms that relies on the metric/metric_kwargs parameters to specify a true metric in order to return correct results.

@ogrisel
Copy link
Member
ogrisel commented Sep 7, 2022

I'll vote for 1. as I don't have seen a convincing use case to allow it, and on top of it, it is available via passing scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) to KNeighborsClassifier.

  1. is definitely simpler to implement and maintain but is somewhat artificially limiting the potential use cases for this class.

scipy.spatial.distance.cdist(X, Y, "minkowski", p=0.1) is a potential workaround be would not benefit from the optimized / chunked / parallel Cython implementation of the pairwise distance + reduction computation.

@jeremiedbb
Copy link
Member

I would allow it for brute force and raise for KD/ball-tree, but no strong opinion.

@lorentzenchr
Copy link
Member

@ogrisel What is your favorite option for which you would vote?

@ogrisel
Copy link
Member
ogrisel commented Sep 7, 2022

I would allow it for brute force and raise for KD/ball-tree, but no strong opinion.

+1.

@lorentzenchr
Copy link
Member

I could live with that. What do the others think? @raymondj-pace, @jjerphan, @adrinjalali @eschibli?

@jjerphan
Copy link
Member
jjerphan commented Sep 7, 2022

Works for me. 👍

@adrinjalali
Copy link
Member

Sounds like a good resolution to me.

@RudreshVeerkhare
Copy link
Contributor

Hi if this still needs to be implemented, can I work on this? @adrinjalali @jjerphan @lorentzenchr @ogrisel

@jjerphan
Copy link
Member

Hi @RudreshVeerkhare. Yes, this needs to be implemented and you can work on it !

@RudreshVeerkhare
Copy link
Contributor

/take

@RudreshVeerkhare
Copy link
Contributor

Hi @RudreshVeerkhare. Yes, this needs to be implemented and you can work on it !

Thanks @jjerphan 😁, I've start with setup.
Will communicate regarding further doubts are suggestions.😄

@RudreshVeerkhare
Copy link
Contributor

@jjerphan I'm done with the setup.
So before I start writing code, just want to make sure that I'm on a correct track.

Basically, I need to add functionality to allow value of p < 1 and p > 0 only when algorithm is explicitly set to "brute", and also to raise a warning that with p < 1, Minkowski is not a valid metric.

Is that correct? 🤔

@jjerphan
Copy link
Member

Yes, based on what was concluded in discussions starting from #22811 (comment) you are correct up to a clarification for what is to raise.

I think @lorentzenchr meant to raise an error when users set algo="ball_tree" or algo="kd_tree" in this case (and I do think this is a better approach over raising a warning).

The logic which sets _fit_method also needs to be adapted if algo="auto" is set by users in this case; it starts here:

if self._fit_method == "auto":

@RudreshVeerkhare
Copy link
Contributor

Thanks @jjerphan for the clarification, I'll start working on it, and will create a WIP PR.
Will discuss further on it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
0