Description
Description
ValueError: Unknown label type: 'unknown'
thrown when passing sparse matrix y in RandomForestClassifier.fit.
The reason is that several numpy functions are called on the variable:
atleast_1d
scikit-learn/sklearn/ensemble/_forest.py
Line 307 in f65cdf2
copy
scikit-learn/sklearn/ensemble/_forest.py
Line 543 in f65cdf2
These calls wrap the matrix in an array with no shape which is not the intended result. I created a minimal fix to check that the rest of the code works and it seems like it. Note however that there are perf issues since a dense copy of the sparse matrix is created.
I have seen that some attention has been drawn on sparse multilabel-indicator type (see #15333 and related) so I would like to know if you are willing to support this format in this peculiar case before starting any PR. If not, I think that at least it should fail with a more meaningful error.
Steps/Code to Reproduce
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_rcv1
X, y = fetch_rcv1(return_X_y=True)
rfc = RandomForestClassifier()
rfc.fit(X[:2], y[:2])
Expected Results
No error is thrown. Or an error indicating that sparse multilabel-indicator are not supported.
Actual Results
In [2]: from sklearn.ensemble import RandomForestClassifier
In [3]: from sklearn.datasets import fetch_rcv1
In [4]: X, y = fetch_rcv1(return_X_y=True)
In [5]: rfc = RandomForestClassifier()
In [6]: rfc.fit(X[:2], y[:2])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-3f7ac458bf48> in <module>
----> 1 rfc.fit(X[:2], y[:2])
~/scikit-learn/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
319 self.n_outputs_ = y.shape[1]
320
--> 321 y, expanded_class_weight = self._validate_y_class_weight(y)
322
323 if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:
~/scikit-learn/sklearn/ensemble/_forest.py in _validate_y_class_weight(self, y)
539
540 def _validate_y_class_weight(self, y):
--> 541 check_classification_targets(y)
542
543 y = np.copy(y)
~/scikit-learn/sklearn/utils/multiclass.py in check_classification_targets(y)
167 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
168 'multilabel-indicator', 'multilabel-sequences']:
--> 169 raise ValueError("Unknown label type: %r" % y_type)
170
171
ValueError: Unknown label type: 'unknown'
Versions
System:
python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /Users/aabraham/.conda/envs/sklearn/bin/python
machine: Darwin-19.0.0-x86_64-i386-64bit
Python dependencies:
pip: 19.3.1
setuptools: 42.0.2.post20191203
sklearn: 0.23.dev0
numpy: 1.18.0
scipy: 1.4.1
Cython: 0.29.14
pandas: 0.25.3
matplotlib: 3.1.1
joblib: 0.14.1
Built with OpenMP: False