-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Option to return full decision paths when predicting with decision trees or random forest #2937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If we decide to go in this direction and add such a functionality (I must |
Thanks for your contribution @andosa ! While I think returning full paths is an interesting feature, I would personnally be against adding your example Also, as @GaelVaroquaux suggested, I wouldn't overload the predict method but would rather create a dedicated new method for computing prediction paths. |
Part of the point is that scikit-learn's capability to explain its decisions currently requires a fairly deep understanding of the model internals. If the library will not provide an API for explaining its predictions (model introspection in general, as well as for particular input), then examples are probably a good idea, though I take your point that the analysis is non-standard in this case. |
The exact contributions of features in a forest is computed with feature importances, as we currently have them. Computing the contribution of feature as did in the example of this PR, as if the output value in a tree was an additive function of the input features, really is something non-standard. (I don't say that it may not be useful though.) |
…ading the predict method * Fixed a bug of decision path array being typed int32 instead of SIZE_t
I've lifted the code into a separate function instead of overloading predict method.
Not sure what you mean by "few situations". If you mean single output only, it will work exactly the same way for multi-output trees, where you would simply have a list of contributions (instead of just one) from each feature: one contribution per output. I do agree that the additive representation of tree predictions is somewhat non-standard, in the sense that its typically not something in textbooks. At the same time i feel it is something more people should know about, since it is very useful in understanding the model behavior on particular data (vs the static view you get from feature importances) |
Conflicts: sklearn/ensemble/forest.py sklearn/tree/_tree.c sklearn/tree/_tree.pyx sklearn/tree/tests/test_tree.py
You say typically. Can you find a reference to this or another technique for quantitative decision path inspection? |
On the additive representation of predictions, see for example http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6642461 But even regardless of this particular technique, the ability to inspect the decision |
Hi, I find this feature highly interesting. However, I have the feeling that it would more appropriate as a public function that take a fitted/unfitted estimator and call the "private" method of the I haven't look in detail at the current example. But I have the feeling that this should be discussed or made in another pull request. As an example, I would have expected something simpler: a fitted decision tree with the path of one samples in this tree (maybe on iris?). |
I'm not sure what you mean. Could you give a usage example, @arjoly? |
Sorry for the unclear reasonning. For the functionality in this pr, I would use it like that
where Thinking about it an alternative option could be to improve the apply method. |
Keep apply() lightweight, IMO. On 28 August 2014 21:47, Arnaud Joly notifications@github.com wrote:
|
Which module would |
fwiw, I think I'd prefer to see the method as it is here. Rather, I've not understood the benefit of @arjoly's proposal. On 28 August 2014 22:35, andosa notifications@github.com wrote:
|
I removed the feature contribution example. Agreed that it's probably a bit too involved and obscure to be in the examples list. The pull request now includes nothing but two conceptual changes that both help with model and prediction inspection:
As such, i think the pull request is very straightforward and non-controversial. Anything else I could do to speed up merging? |
…nto tree_paths
Conflicts: appveyor.yml doc/modules/cross_validation.rst doc/modules/kernel_approximation.rst doc/modules/model_evaluation.rst doc/modules/sgd.rst doc/modules/unsupervised_reduction.rst doc/whats_new.rst sklearn/__init__.py sklearn/cluster/tests/test_bicluster.py sklearn/cluster/tests/test_hierarchical.py sklearn/cluster/tests/test_spectral.py sklearn/datasets/base.py sklearn/datasets/tests/test_svmlight_format.py sklearn/decomposition/base.py sklearn/decomposition/incremental_pca.py sklearn/decomposition/tests/test_incremental_pca.py sklearn/ensemble/bagging.py sklearn/ensemble/gradient_boosting.py sklearn/ensemble/tests/test_gradient_boosting.py sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py sklearn/ensemble/weight_boosting.py sklearn/feature_extraction/dict_vectorizer.py sklearn/feature_extraction/text.py sklearn/feature_selection/tests/test_feature_select.py sklearn/feature_selection/variance_threshold.py sklearn/gaussian_process/tests/test_gaussian_process.py sklearn/grid_search.py sklearn/linear_model/base.py sklearn/linear_model/cd_fast.c sklearn/linear_model/cd_fast.pyx sklearn/linear_model/coordinate_descent.py sklearn/linear_model/logistic.py sklearn/linear_model/ridge.py sklearn/linear_model/sgd_fast.c sklearn/linear_model/sgd_fast.pyx sklearn/linear_model/stochastic_gradient.py sklearn/linear_model/tests/test_coordinate_descent.py sklearn/linear_model/tests/test_least_angle.py sklearn/linear_model/tests/test_logistic.py sklearn/linear_model/tests/test_sgd.py sklearn/metrics/regression.py sklearn/metrics/tests/test_score_objects.py sklearn/neighbors/base.py sklearn/neighbors/kde.py sklearn/neighbors/nearest_centroid.py sklearn/neighbors/tests/test_nearest_centroid.py sklearn/neighbors/tests/test_neighbors.py sklearn/preprocessing/data.py sklearn/preprocessing/tests/test_data.py sklearn/svm/base.py sklearn/svm/bounds.py sklearn/svm/classes.py sklearn/svm/tests/test_svm.py sklearn/tests/test_common.py sklearn/tests/test_grid_search.py sklearn/tests/test_isotonic.py sklearn/tests/test_pipeline.py sklearn/tree/_tree.c sklearn/tree/_tree.pxd sklearn/tree/_tree.pyx sklearn/tree/tests/test_tree.py sklearn/tree/tree.py sklearn/utils/__init__.py sklearn/utils/extmath.py sklearn/utils/sparsefuncs.py sklearn/utils/testing.py sklearn/utils/tests/test_extmath.py sklearn/utils/tests/test_utils.py sklearn/utils/tests/test_validation.py sklearn/utils/validation.py
Could we revisit this PR? Tree code seems to have stabilized now, and this is simple but a very useful feature for a large number of use cases (that could be moved to a private function) |
if path[i] == -1: | ||
break | ||
node_id = path[i] | ||
pred += clf.tree_.value[node_id][0][0] - base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two white spaces after "=". Can you run pep8 on the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commited pep8 fixes
|
||
# Check data | ||
if getattr(X, "dtype", None) != DTYPE or X.ndim != 2: | ||
X = array2d(X, dtype=DTYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the _validate_X_predict
?
Other opinion on this? (@glouppe ?) |
|
||
#def _parallel_predict_paths(trees, X): | ||
# """Private function used to compute a batch of prediction paths within a job.""" | ||
# return [tree.decision_paths(X) for tree in trees] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this?
@andosa if you need some help for the sparse algorithm, I may find some time. |
Since 0.17 now has both value recording in the intermediate nodes, and the option to return node_ids for predictions via apply, the paths can be extracted via those. So explicit method for that isn't actually needed anymore. I think this one should be closed? I wrote an example on how to extract the paths with current master to decompose every prediction into bias + feature contribution components (such that for each prediction, prediction = bias + feature_1_contribution + ... + feature_n_contribution). There have been a lot of people that do practical data science that have found this very useful (and i don't know of any other ML library that exposes this). Do you think it might be worth including into scikit-learns examples or even tree utils (like export_graphviz). |
Since the apply method returns the leaves id and not the node id, I still think that there is some place for the decision node path. |
But isn't the leave id the node_id of the leaf? And since you have exactly one path to each leaf, you have a mapping from leave ids' to paths. |
Yes, however computing the path in python is likely to be slow. There are feature extraction methods based on the extraction of node indicator features for boosting and random forest based algorithm. |
Computing all the paths is a one off procedure, just one DFS, after which you have a lookup table that can map leaf id -> path. |
I agree that we should try to avoid too much cython. |
Closing this. I'll create an open issue to have an example showing how to extract prediction paths. |
A very useful feature for decision trees is the option to access the full decision path for a prediction, i.e. the path from root to leaf for each decision.
This enables to interpret the model in the context of data in a very useful way. In particular, it allows to see exactly why the tree or forest arrived at a particular result, breaking down the prediction into exact components (feature contributions). This is much needed in certain application areas, for example in credit card fraud detection it is important to get an understanding why the model labels a transaction as fraudulent.
The pull request implements a predict option for random forest and decision tree: when return_paths keyword argument is set to True, paths are returned instead of predictions. I have not added docstrings yet, I assume the API might be expected to be different (another method instead of keyword argument to predict?)
In addition, there is a change to store values at each node, not just the leaves (useful when interpreting the tree).