10000 [MRG] Fixes tree and forest classification for non-numeric multi-target by mitar · Pull Request #11458 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Fixes tree and forest classification for non-numeric multi-target #11458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 19, 2019

Conversation

mitar
Copy link
Contributor
@mitar mitar commented Jul 8, 2018

Fixes #11451.

This fixes the issue that trees and forests cannot classify (but they can fit) non-numeric targets, when there are multiple targets.

@mitar
Copy link
Contributor Author
mitar commented Oct 9, 2018

Any update on this PR?

@pytest.mark.parametrize('name', FOREST_CLASSIFIERS)
@pytest.mark.parametrize('oob_score', (True, False))
def test_multi_target(name, oob_score):
check_multi_target(name, oob_score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for not having the body of the check_multi_target function directly here?


@pytest.mark.parametrize('name', CLF_TREES)
def test_multi_target(name):
check_multi_target(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.


return predictions
return np.array(predictions).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would try to figure the dtype of the array, right? how much is it slower than the status quo?

@adrinjalali
Copy link
Member

You also need to rebase/merge master, you've got conflicts.

Other than that, I'm really not sure if this is a good idea. How many other estimators do we have that support string outputs? I suppose the recommended way is to convert the values before feeding them to estimators. I may be wrong.

@jnothman
Copy link
Member

We support string targets where 1d (i.e. single target). I'm not entirely against supporting strong labels in multi output, but it should be by making sure that all estimators with multi output multiclass support, and any metrics, support this case. Let alone the case of mixed numeric and string data. At the moment I can't see that we test multi output multiclass in common tests at all.
When you think about it in terms of how much support it takes across the board to ensure consistent interfaces, I hope you can understand why we might not want to start

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does your implementation do with a mix of string and numeric targets?

@rok rok force-pushed the fix-multi-target branch from e215890 to b26e943 Compare February 16, 2019 13:08
@rok rok force-pushed the fix-multi-target branch from 5c5ecff to fcd597a Compare February 16, 2019 16:48
@rok
Copy link
Contributor
rok commented Feb 16, 2019

I've refactored the tests and the code a bit. Also PR is rebased to master.

@jnothman: What does your implementation do with a mix of string and numeric targets?

from sklearn.ensemble import ExtraTreesClassifier
import numpy as np

X = np.random.choice([1, 2, 3], (100, 100))

ys = np.array([
    np.random.choice(['foo', 'bar'], (100,)),
    np.random.choice([1, 2, 3], (100,))
])

clf = ExtraTreesClassifier()
clf = clf.fit(X, ys.T)
clf.predict(X)

Returns:

array([['bar', '1'],
       ['bar', '2'],
       ['foo', '1'],
...

Doing the same with a regressor would fail as targets need to be numerical.

Training target array is upcast to one dtype and we will get an array with the same dtype back from .predict. I feel this is an acceptable behavior.

A way to support a real mix of dtypes would be with structured array, but I don't know if we really want to do that?

# Make multi-target.
ys = np.hstack([y, y])

# Try to fix and predict.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix -> fit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed :)

y = np.array(['foo' if v else 'bar' for v in y]).reshape((y.shape[0], 1))

# Make multi-target.
ys = np.hstack([y, y])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try with a string and a numerical column just to be on the safe side in the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@adrinjalali
Copy link
Member

This looks good to me.

Since Joel mentioned it, could you please kindly try adding the same test on common tests (sklearn/tests/test_common.py) and tell us where we fail on classifiers?

@rok
Copy link
Contributor
rok commented Feb 19, 2019

Thaks for the quick responese @adrinjalali!
So that would mean adding mutlitarget checks here and here. Correct?
We would probably need to exclude certain regressors and classifiers that don't (yet) support multitarget.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support doing common tests in a separate PR, ideally remembering to remove these tests as redundant.

@adrinjalali
Copy link
Member

So that would mean adding mutlitarget checks here and here. Correct?

Yes, it'd be awesome if you could follow up on that and open a new PR and Joel suggests. This looks good for now. Merging this, and will follow up on your other [hopefully to be coming soon] PR :)

@adrinjalali adrinjalali merged commit 95993a4 into scikit-learn:master Feb 19, 2019
@adrinjalali adrinjalali mentioned this pull request Feb 19, 2019
17 tasks
@mitar mitar deleted the fix-multi-target branch February 19, 2019 09:52
@rok
Copy link
Contributor
rok commented Feb 19, 2019

Thank @adrinjalali, @mitar and @jnothman.
I'm creating a new issue #13187 for common tests and will open a PR soon.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
…scikit-learn#11458)

* Fixes tree and forest classification for non-numeric multi-target.

Fixes scikit-learn#11451.

* Renaming test functions, adding dtype to predictions array in tree.py.

* Fixing flake8 issue.

* Adding ignore warning to test_forest.py.

* Switching to iris data for tests.
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
…scikit-learn#11458)

* Fixes tree and forest classification for non-numeric multi-target.

Fixes scikit-learn#11451.

* Renaming test functions, adding dtype to predictions array in tree.py.

* Fixing flake8 issue.

* Adding ignore warning to test_forest.py.

* Switching to iris data for tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An error is thrown when using Random forest classifier multi-target non-numeric classes
4 participants
0