Open
Description
Describe the bug
SVC().fit(X, y, w) fails when the targets y are multiclass and the sample_weights w zero out one of the classes.
- Dense X produces incorrect arrays for support_, n_support_, and dual_coef_ attributes, some with wrong dimension.
- Sparse X errors out when trying to construct sparse dual_coef_ from incompatible arguments.
A warning is emitted (e.g., "class label 0 specified in weight is not found"), but it does not indicate that the arrays on the trained SVC object are incorrect.
Seems to be a case that was not tested by PR #14286.
Workaround
Replace the zero weights (or negative weights) with very small values like 1e-16.
Steps/Code to Reproduce
import numpy as np
from sklearn.svm import SVC
X = np.array([[0., 0.], [1., 0.], [0., 1.]]) # or sp.csr_matrix(...)
y = [0, 1, 2]
w = [0., 1., 1.] # class 0 has zero weight
clf = SVC().fit(X, y, w)
Expected Results
The fitted attributes should be
classes_: [0 1 2]
support_: [1 2]
support_vectors_: [[1. 0.]
[0. 1.]]
n_support_: [0 1 1]
dual_coef_: [[0. 0. 0.]
[0. 1. -1.]]
assuming the 'arbitrary' values in dual_coef_ are set to zero.
Actual Results
For dense X, the fitted attributes are actually
classes_: [0 1 2]
support_: [0 1] <-- should be [1 2]
support_vectors_: [[1. 0.]
[0. 1.]]
n_support_: [1 1] <-- should be [0 1 1]
dual_coef_: [[ 1. -1.]] <-- should be [[0. 0. 0.] [0. 1. -1.]]
For sparse X it raises
ValueError: indices and data should have the same size
with traceback
sklearn/svm/_base.py:252, in fit(self, X, y, sample_weight)
--> 252 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
sklearn/svm/_base.py:413, in _sparse_fit(self, X, y, sample_weight, solver_type, kernel, random_seed)
--> 413 self.dual_coef_ = sp.csr_matrix(
414 (dual_coef_data, dual_coef_indices, dual_coef_indptr), (n_class, n_SV)
415 )
scipy/sparse/_compressed.py:106, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
--> 106 self.check_format(full_check=False)
scipy/sparse/_compressed.py:176, in _cs_matrix.check_format(self, full_check)
174 # check index and data arrays
175 if (len(self.indices) != len(self.data)):
--> 176 raise ValueError("indices and data should have the same size")
Versions
System:
python: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:55:37) [Clang 14.0.6 ]
executable: miniconda3/bin/python
machine: macOS-13.1-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.0
pip: 22.3.1
setuptools: 65.6.3
numpy: 1.24.1
scipy: 1.10.0
Cython: 0.29.33
pandas: 1.5.2
matplotlib: 3.6.2
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: mkl
prefix: libmkl_rt
filepath: miniconda3/lib/libmkl_rt.dylib
version: 2020.0.4
threading_layer: intel
num_threads: 8
user_api: openmp
internal_api: openmp
prefix: libomp
filepath: miniconda3/lib/libomp.dylib
version: None
num_threads: 16