8000 Improvement on Permutation importance example in release highlights · Issue #17313 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Improvement on Permutation importance example in release highlights #17313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
venkyyuvy opened this issue May 23, 2020 · 2 comments · Fixed by #17331
Closed

Improvement on Permutation importance example in release highlights #17313

venkyyuvy opened this issue May 23, 2020 · 2 comments · Fixed by #17331

Comments

@venkyyuvy
Copy link
Contributor

Describe the issue linked to the documentation

when I look at the example given here, I got confused why the feature names are not sorted with respect to importance.

Suggest a potential alternative/fix

X, y = make_classification(random_state=0, n_features=5,
                           n_informative=3)
rf = RandomForestClassifier(random_state=0).fit(X, y)
result = permutation_importance(rf, X, y, n_repeats=10, random_state=0,
                                n_jobs=-1)

feature_names = np.array([f'x_{i}' for i in range(X.shape[1])])

fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()

image

Also, for clarity may be we can set n_redundant=0, hence emphasising that permutation_importance identifies the 3 informative features precisely.

X, y = make_classification(random_state=0, n_features=5,
                           n_informative=3, n_redundant=0)
rf = RandomForestClassifier(random_state=0).fit(X, y)
result = permutation_importance(rf, X, y, n_repeats=10, random_state=0,
                                n_jobs=-1)

feature_names = np.array([f'x_{i}' for i in range(X.shape[1])])

fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()

image

@jnothman
Copy link
Member

I agree, the ticklabels are misleading. PR welcome.

I am happy with keeping the redundant features in, but could be persuaded otherwise.

@venkyyuvy
Copy link
Contributor Author
venkyyuvy commented May 25, 2020

As you know, the results of permutation_importance will suffer when the features are correlated. Hence, for an intro example (when n_reduntant!=0 we will have duplicates of same feature - 100% correlation) do we really have to showcase the one which is depicting the above mentioned con?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0