-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Incorrect decision_function in linear_model? #19139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is true that the signed distance should be defined as the dot product between the normalized weight and the data point. It is defined in this way in the ESL p.130. However, I am not sure that we want to modify our code or instead modify the definition to reflect this particularity. |
ping @agramfort |
good catch. Indeed the coef_ attribute is to be applied on raw / unscaled / unnormalized features. It's to save computation time. +1 to modify the docstring. |
Thanks for looking at this. My initial thought was also to just change the docstring, but the problem may be a bit more serious: if you look a few lines down at how the decision_function scores are used in the 'predict' function you'll see that in the multiclass case:
This should be assigning each sample the class of the hyperplane it is farthest above. The code above will do this correctly only if the scores actually represent the signed distances, which they currently do not. Note that in binary classification only the sign of the distances is used
which is invariant to the positive scaling of the scores, so the problem wouldn't show up there. |
@agramfort Just to clarify, the problem is not with the features (the columns of the samples matrix X) that coef_ are applied to e.g. whether they're normalized or not, but with how coef_ and intercept_ are used to compute the scores from the features for each sample. For example, given the way 'predict' uses 'decision_function', I can change the class assigned to a sample by scaling coef_ and the corresponding elements of intercept_, even though such a scaling doesn't change the corresponding hyperplanes. |
@stootoon you suggest a bug but I do not agree. If you fit a multiclass model eg with multinomial logistic regression you have scaled the features and then the same scale is applied to all hyperplanes. It would not make sense to just scale one. convinced? |
@agramfort I don't think my example was clear. Here's a toy example of what I meant: # Generate 50 points on the circle in 5 equally-spaced clumps from -pi/2 to pi/2
n_samples_per_class = 10
y = np.outer([-2,-1,0,1,2], np.ones(n_samples_per_class)).flatten()
theta = y * np.pi/4 + np.random.randn(*y.shape)*5e-2
X = np.array([np.cos(theta), np.sin(theta)]).T
# Train a multiclass classifier
clf = LinearSVC(fit_intercept=False).fit(X,y)
y_pred_orig = clf.predict(X)
# Predicted label of first sample BEFORE coef scaling
print(f"{y_pred_orig[0]}.") # Prints -2
# KEY STEP: Scale down the coefficients of the first class.
clf.coef_[0]/=10
# Since we learned a model without intercept, this doesn't change
# the corresponding hyperplane, so the assigned classes shouldn't change.
# But they do:
y_pred_pert = clf.predict(X)
print(f"{y_pred_pert[0]}.") # Prints -1 Here's a picture of what's happening. The points are the datapoints colored by their predicted class, and the lines are the coefficient vectors for each class, before (left) and after the coefficient perturbation: You'll see that after the black vector is scaled down (right panel), the bottom cluster of points is assigned to the red class, even though the hyperplane defined by the black vector hasn't changed. So my point is that if an algorithm assigns classes by computing distances to hyperplanes (as indicated by the docstring) then its class assignments shouldn't change when the coefficient vectors defining the normals to those hyperplanes are (positively) scaled, because the corresponding hyperplanes haven't changed. But as the toy example above shows, the class assignments do change. I think the solution to my confusion is that (unlike what the docstring implies) the algorithms in linear_model don't actually assign classes by computing distances to hyperplanes, but rather by comparing the outputs of decision_function as originally written, and these are not actually distances to hyperplanes unless the coefficients happen to be unit norm. So, to avoid confusing anyone else, the docstring for decision_function should be changed to remove mention of distances to hyperplanes. Do you agree? |
if you change a single coefficient, it is quite normal that the assignment will change. if you divide all coefficients, the distance will change but not the assignment. There is no bug in this regard. |
@glemaitre agreed, but my point is that the docstring says what's being computed is distances to hyperplanes, and scaling coefficients like i did in the toy example doesn't change distances to hyperplanes. Therefore if the docstring is to be believed the labels shouldn't change, but they do. The reason why is that what decision_function actually computes is not a distance to a hyperplane (in general). This I expect is by design, so the docstring should be updated to remove mention of distances to hyperplanes. So in summary: not a bug, but incorrect docstring. |
No, it does. Your scaling changes the vector normal to the hyperplane. I would refer to my above reference in ESL, p.130, eq. 4.40 |
@glemaitre It doesn't. Notice that in the toy example I'm not fitting an intercept. So in the notation of the equation you cite, beta_0 = 0. Then, applying a positive scaling to beta, which is what I'm doing, doesn't change the distance because the denominator gets scaled the same way. That is, if we scale by k, then distance = beta'x/|beta| = (k beta)'x/|k beta|. |
My bad, I did not see that we have a |
with coef_.shape = (n_classes, n_features)
you can scale the columns of coef_ but scaling one row only or all rows but
with a different scale
will break the predict.
yes it's a docstring fix that is needed. PR welcome
… |
@glemaitre scaling the coefficient for a single class shouldn't impact the class only if we believed the docstring that decision_function was computing distances to hyperplanes. But as we've discussed, that's not what decision_function is doing. @agramfort I've just submitted a PR that fixes the docstring and leaves the rest of the code untouched. Finally, just wanted to say thanks to you both and the rest of the team for developing sklearn, it's been a real help in my work as I'm sure it has for the rest of the community. |
Hi folks,
The documentation of decision_function in LinearClassifierMixin says that it 'Predicts confidence scores for samples' where 'the confidence score for a sample is the signed distance of that sample to the hyperplane'. A few lines later the scores are computed as the scalar product of each sample with each of the coefficients, plus the intercepts. The problem with this is that without normalizing the coefficients to each have unit norm, the resulting scores will not actually be distances to the corresponding hyperplanes.
For example, in a setting where the intercept is zero, if I double the coefficient vectors the corresponding hyperplanes remain unchanged so the distances of each sample to each hyperplane should also remain unchanged. However, the scores, as defined above, would be changed, doubled in fact.
Am I mistaken or is there a reason why the coefficients are unnormalized?
Thanks,
Sina
The text was updated successfully, but these errors were encountered: