10000 RandomForestClassifier does not handle sparse multilabel-indicator y · Issue #15958 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
RandomForestClassifier does not handle sparse multilabel-indicator y #15958
Closed
@AlexandreAbraham

Description

@AlexandreAbraham

Description

ValueError: Unknown label type: 'unknown' thrown when passing sparse matrix y in RandomForestClassifier.fit.

The reason is that several numpy functions are called on the variable:

These calls wrap the matrix in an array with no shape which is not the intended result. I created a minimal fix to check that the rest of the code works and it seems like it. Note however that there are perf issues since a dense copy of the sparse matrix is created.

I have seen that some attention has been drawn on sparse multilabel-indicator type (see #15333 and related) so I would like to know if you are willing to support this format in this peculiar case before starting any PR. If not, I think that at least it should fail with a more meaningful error.

Steps/Code to Reproduce

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_rcv1

X, y = fetch_rcv1(return_X_y=True)
rfc = RandomForestClassifier()
rfc.fit(X[:2], y[:2])

Expected Results

No error is thrown. Or an error indicating that sparse multilabel-indicator are not supported.

Actual Results

In [2]: from sklearn.ensemble import RandomForestClassifier                                                                                

In [3]: from sklearn.datasets import fetch_rcv1                                                                                            

In [4]: X, y = fetch_rcv1(return_X_y=True)                                                                                                 

In [5]: rfc = RandomForestClassifier()                                                                                                     

In [6]: rfc.fit(X[:2], y[:2])                                                                                                              
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-3f7ac458bf48> in <module>
----> 1 rfc.fit(X[:2], y[:2])

~/scikit-learn/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight)
    319         self.n_outputs_ = y.shape[1]
    320 
--> 321         y, expanded_class_weight = self._validate_y_class_weight(y)
    322 
    323         if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:

~/scikit-learn/sklearn/ensemble/_forest.py in _validate_y_class_weight(self, y)
    539 
    540     def _validate_y_class_weight(self, y):
--> 541         check_classification_targets(y)
    542 
    543         y = np.copy(y)

~/scikit-learn/sklearn/utils/multiclass.py in check_classification_targets(y)
    167     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    168                       'multilabel-indicator', 'multilabel-sequences']:
--> 169         raise ValueError("Unknown label type: %r" % y_type)
    170 
    171 

ValueError: Unknown label type: 'unknown'

Versions

System:
    python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /Users/aabraham/.conda/envs/sklearn/bin/python
   machine: Darwin-19.0.0-x86_64-i386-64bit

Python dependencies:
       pip: 19.3.1
setuptools: 42.0.2.post20191203
   sklearn: 0.23.dev0
     numpy: 1.18.0
     scipy: 1.4.1
    Cython: 0.29.14
    pandas: 0.25.3
matplotlib: 3.1.1
    joblib: 0.14.1

Built with OpenMP: False

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0