Option to return full decision paths when predicting with decision trees or random forest #2937

andosa · 2014-03-05T23:04:56Z

A very useful feature for decision trees is the option to access the full decision path for a prediction, i.e. the path from root to leaf for each decision.
This enables to interpret the model in the context of data in a very useful way. In particular, it allows to see exactly why the tree or forest arrived at a particular result, breaking down the prediction into exact components (feature contributions). This is much needed in certain application areas, for example in credit card fraud detection it is important to get an understanding why the model labels a transaction as fraudulent.

The pull request implements a predict option for random forest and decision tree: when return_paths keyword argument is set to True, paths are returned instead of predictions. I have not added docstrings yet, I assume the API might be expected to be different (another method instead of keyword argument to predict?)

In addition, there is a change to store values at each node, not just the leaves (useful when interpreting the tree).

GaelVaroquaux · 2014-03-05T23:11:43Z

I assume the API might be expected to be different (another method
instead of keyword argument to predict?)

If we decide to go in this direction and add such a functionality (I must
say that I haven't understood the compeling argument to add it), I think
that it should be in the form of a new method, rather than overloading
the predict method.

glouppe · 2014-03-08T07:28:55Z

Thanks for your contribution @andosa !

While I think returning full paths is an interesting feature, I would personnally be against adding your example compute_feature_contributions.py. What you call "feature contributions" is obviously something non-standard and seems to only work in a few situations (currently, your code only work for regression or binary classification, for a single output). I would rather let users implement such a functionality, assuming they know what they do, rather than maintaining this into master.

Also, as @GaelVaroquaux suggested, I wouldn't overload the predict method but would rather create a dedicated new method for computing prediction paths.

jnothman · 2014-03-08T23:24:34Z

Part of the point is that scikit-learn's capability to explain its decisions currently requires a fairly deep understanding of the model internals. If the library will not provide an API for explaining its predictions (model introspection in general, as well as for particular input), then examples are probably a good idea, though I take your point that the analysis is non-standard in this case.

glouppe · 2014-03-09T07:20:20Z

Part of the point is that scikit-learn's capability to explain its decisions currently requires a fairly deep understanding of the model internals. If the library will not provide an API for explaining its predictions (model introspection in general, as well as for particular input), then examples are probably a good idea, though I take your point that the analysis is non-standard in this case.

The exact contributions of features in a forest is computed with feature importances, as we currently have them. Computing the contribution of feature as did in the example of this PR, as if the output value in a tree was an additive function of the input features, really is something non-standard. (I don't say that it may not be useful though.)

…ading the predict method * Fixed a bug of decision path array being typed int32 instead of SIZE_t

coveralls · 2014-03-10T22:14:42Z

Coverage remained the same when pulling 068f05b on andosa:tree_paths into 4437d56 on scikit-learn:master.

andosa · 2014-03-11T20:34:32Z

I've lifted the code into a separate function instead of overloading predict method.

What you call "feature contributions" is obviously something non-standard and seems to only work in a few situations (currently, your code only work for regression or binary classification, for a single output)

Not sure what you mean by "few situations". If you mean single output only, it will work exactly the same way for multi-output trees, where you would simply have a list of contributions (instead of just one) from each feature: one contribution per output.
I could add a utility function that takes the sample frame as input and returns a table shaped (n_samples* (1 + n_features) * n_outputs), each row representing the contribution from each feature (+ trainset mean), summing up to the prediction you would get from the predict method. This would make the example less "ad hoc" and more reusable perhaps.

I do agree that the additive representation of tree predictions is somewhat non-standard, in the sense that its typically not something in textbooks. At the same time i feel it is something more people should know about, since it is very useful in understanding the model behavior on particular data (vs the static view you get from feature importances)

Conflicts: sklearn/ensemble/forest.py sklearn/tree/_tree.c sklearn/tree/_tree.pyx sklearn/tree/tests/test_tree.py

coveralls · 2014-08-26T02:43:38Z

Coverage decreased (-0.02%) when pulling 7c0e173 on andosa:tree_paths into 4b82379 on scikit-learn:master.

jnothman · 2014-08-26T02:48:24Z

I do agree that the additive representation of tree predictions is somewhat non-standard, in the sense that its typically not something in textbooks. At the same time i feel it is something more people should know about, since it is very useful in understanding the model behavior on particular data (vs the static view you get from feature importances)

You say typically. Can you find a reference to this or another technique for quantitative decision path inspection?

andosa · 2014-08-27T03:31:23Z

On the additive representation of predictions, see for example http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6642461

But even regardless of this particular technique, the ability to inspect the decision
paths (along with the estimates in the nodes in addition to leaves) is something that is very useful to have in the data scientist's toolbox.

arjoly · 2014-08-28T09:23:31Z

Hi,

I find this feature highly interesting. However, I have the feeling that it would more appropriate as a public function that take a fitted/unfitted estimator and call the "private" method of the Tree structure. Instead of adding a new method to the api.

I haven't look in detail at the current example. But I have the feeling that this should be discussed or made in another pull request.

As an example, I would have expected something simpler: a fitted decision tree with the path of one samples in this tree (maybe on iris?).

jnothman · 2014-08-28T11:26:01Z

it would more appropriate as a public function that take a fitted/unfitted estimator and call the "private" method of the Tree structure. Instead of adding something new method to the api.

I'm not sure what you mean. Could you give a usage example, @arjoly?

arjoly · 2014-08-28T11:47:52Z

Sorry for the unclear reasonning.
I will try to clarify.

For the functionality in this pr, I would use it like that

est = DecisionTreeClassifier(...).fit(X, y)
for s in interesting_samples:
    path = compute_decision_path(est, X[s])
    # Plot decision tree and path

where compute_decision_path would be the new features added by this pr. The compute_decision_path would indeed called internally est.tree_.decision_path(X).

Thinking about it an alternative option could be to improve the apply method.

jnothman · 2014-08-28T12:01:17Z

Keep apply() lightweight, IMO.

On 28 August 2014 21:47, Arnaud Joly notifications@github.com wrote:

Sorry for the unclear reasonning.
I will try to clarify.

For the functionality in this pr, I would use it like that

est = DecisionTreeClassifier(...).fit(X, y)
for s in interesting_samples:
path = compute_decision_path(est, X[s])
# Plot decision tree and paths

where compute_decision_path would be the new features added by this pr.
The compute_decision_path would indeed called internally
est.tree_.decision_path(X).

Thinking about it a better option could be to improve the apply method
https://github.com/andosa/scikit-learn/blob/tree_paths/sklearn/ensemble/forest.py#L142
.

—
Reply to this email directly or view it on GitHub
#2937 (comment)
.

andosa · 2014-08-28T12:35:56Z

Which module would compute_decision_path live in? Something like tree/tree_utils.py?

jnothman · 2014-08-28T12:49:30Z

fwiw, I think I'd prefer to see the method as it is here. Rather, I've not understood the benefit of @arjoly's proposal.

On 28 August 2014 22:35, andosa notifications@github.com wrote:

Which module would compute_decision_path live in? Something like
tree/tree_utils.py?

—
Reply to this email directly or view it on GitHub
#2937 (comment)
.

coveralls · 2014-09-03T11:00:22Z

Coverage decreased (-0.02%) when pulling 6fe3bb7 on andosa:tree_paths into 4b82379 on scikit-learn:master.

andosa · 2014-09-03T11:22:17Z

I removed the feature contribution example. Agreed that it's probably a bit too involved and obscure to be in the examples list.

The pull request now includes nothing but two conceptual changes that both help with model and prediction inspection:

Values for trained trees are kept for each node, not just for leaves
Decision tree and random forest have a method to return paths.

As such, i think the pull request is very straightforward and non-controversial. Anything else I could do to speed up merging?

…nto tree_paths

Conflicts: appveyor.yml doc/modules/cross_validation.rst doc/modules/kernel_approximation.rst doc/modules/model_evaluation.rst doc/modules/sgd.rst doc/modules/unsupervised_reduction.rst doc/whats_new.rst sklearn/__init__.py sklearn/cluster/tests/test_bicluster.py sklearn/cluster/tests/test_hierarchical.py sklearn/cluster/tests/test_spectral.py sklearn/datasets/base.py sklearn/datasets/tests/test_svmlight_format.py sklearn/decomposition/base.py sklearn/decomposition/incremental_pca.py sklearn/decomposition/tests/test_incremental_pca.py sklearn/ensemble/bagging.py sklearn/ensemble/gradient_boosting.py sklearn/ensemble/tests/test_gradient_boosting.py sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py sklearn/ensemble/weight_boosting.py sklearn/feature_extraction/dict_vectorizer.py sklearn/feature_extraction/text.py sklearn/feature_selection/tests/test_feature_select.py sklearn/feature_selection/variance_threshold.py sklearn/gaussian_process/tests/test_gaussian_process.py sklearn/grid_search.py sklearn/linear_model/base.py sklearn/linear_model/cd_fast.c sklearn/linear_model/cd_fast.pyx sklearn/linear_model/coordinate_descent.py sklearn/linear_model/logistic.py sklearn/linear_model/ridge.py sklearn/linear_model/sgd_fast.c sklearn/linear_model/sgd_fast.pyx sklearn/linear_model/stochastic_gradient.py sklearn/linear_model/tests/test_coordinate_descent.py sklearn/linear_model/tests/test_least_angle.py sklearn/linear_model/tests/test_logistic.py sklearn/linear_model/tests/test_sgd.py sklearn/metrics/regression.py sklearn/metrics/tests/test_score_objects.py sklearn/neighbors/base.py sklearn/neighbors/kde.py sklearn/neighbors/nearest_centroid.py sklearn/neighbors/tests/test_nearest_centroid.py sklearn/neighbors/tests/test_neighbors.py sklearn/preprocessing/data.py sklearn/preprocessing/tests/test_data.py sklearn/svm/base.py sklearn/svm/bounds.py sklearn/svm/classes.py sklearn/svm/tests/test_svm.py sklearn/tests/test_common.py sklearn/tests/test_grid_search.py sklearn/tests/test_isotonic.py sklearn/tests/test_pipeline.py sklearn/tree/_tree.c sklearn/tree/_tree.pxd sklearn/tree/_tree.pyx sklearn/tree/tests/test_tree.py sklearn/tree/tree.py sklearn/utils/__init__.py sklearn/utils/extmath.py sklearn/utils/sparsefuncs.py sklearn/utils/testing.py sklearn/utils/tests/test_extmath.py sklearn/utils/tests/test_utils.py sklearn/utils/tests/test_validation.py sklearn/utils/validation.py

andosa · 2015-04-23T10:53:52Z

Could we revisit this PR? Tree code seems to have stabilized now, and this is simple but a very useful feature for a large number of use cases (that could be moved to a private function)

amueller · 2015-04-27T23:07:48Z

@jnothman I think the idea behind @arjoly's proposal is to not clutter the interface of the forest too much with very specialized methods. I can see where he is coming from, but I feel we rarely take that road, and it makes a feature a lot less discoverable. I am undecided on the issue.

amueller · 2015-04-27T23:08:27Z

sklearn/tree/tests/test_tree.py

+                if path[i] == -1:
+                    break
+                node_id = path[i]
+                pred +=  clf.tree_.value[node_id][0][0] - base


two white spaces after "=". Can you run pep8 on the file?

Commited pep8 fixes

amueller · 2015-04-27T23:09:23Z

@glouppe @arjoly wdyt?

arjoly · 2015-04-30T08:19:40Z

sklearn/ensemble/forest.py

+
+        # Check data
+        if getattr(X, "dtype", None) != DTYPE or X.ndim != 2:
+            X = array2d(X, dtype=DTYPE)


Can you use the _validate_X_predict?

arjoly · 2015-04-30T08:26:50Z

@jnothman I think the idea behind @arjoly's proposal is to not clutter the interface of the forest too much with very specialized methods. I can see where he is coming from, but I feel we rarely take that road, and it makes a feature a lot less discoverable. I am undecided on the issue.

Other opinion on this? (@glouppe ?)

arjoly · 2015-06-08T13:12:02Z

sklearn/ensemble/forest.py

+
+#def _parallel_predict_paths(trees, X):
+#    """Private function used to compute a batch of prediction paths within a job."""
+#    return [tree.decision_paths(X) for tree in trees]


Could you remove this?

arjoly · 2015-06-12T13:46:13Z

@andosa if you need some help for the sparse algorithm, I may find some time.

andosa · 2015-08-12T21:20:46Z

Since 0.17 now has both value recording in the intermediate nodes, and the option to return node_ids for predictions via apply, the paths can be extracted via those. So explicit method for that isn't actually needed anymore. I think this one should be closed?

I wrote an example on how to extract the paths with current master to decompose every prediction into bias + feature contribution components (such that for each prediction, prediction = bias + feature_1_contribution + ... + feature_n_contribution).
Writeup at http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/.
Code at https://github.com/andosa/treeinterpreter.

There have been a lot of people that do practical data science that have found this very useful (and i don't know of any other ML library that exposes this). Do you think it might be worth including into scikit-learns examples or even tree utils (like export_graphviz).

arjoly · 2015-08-13T08:25:00Z

Since the apply method returns the leaves id and not the node id, I still think that there is some place for the decision node path.

andosa · 2015-08-13T08:49:53Z

But isn't the leave id the node_id of the leaf? And since you have exactly one path to each leaf, you have a mapping from leave ids' to paths.

arjoly · 2015-08-13T08:56:19Z

Yes, however computing the path in python is likely to be slow. There are feature extraction methods based on the extraction of node indicator features for boosting and random forest based algorithm.

andosa · 2015-08-13T09:01:23Z

Computing all the paths is a one off procedure, just one DFS, after which you have a lookup table that can map leaf id -> path.

amueller · 2015-08-13T21:04:50Z

I agree that we should try to avoid too much cython.
I think an example would be nice.

glouppe · 2015-10-19T09:51:23Z

Closing this.

I'll create an open issue to have an example showing how to extract prediction paths.

andosa added 5 commits March 3, 2014 23:20

Node value saving for all nodes + tree paths added (return_paths option)

65a9042

Test code for paths and node values

423a41e

Adding return_paths option to random forest regressor and classifier

facf67f

Example for feature contribution calculation

826ba78

Additions to the example

94c2f87

* Created new methods for decision path computation instead of overlo…

068f05b

…ading the predict method * Fixed a bug of decision path array being typed int32 instead of SIZE_t

Ando Saabas added 3 commits August 25, 2014 14:23

Merge remote-tracking branch 'upstream/master'

2881b60

Conflicts: sklearn/ensemble/forest.py sklearn/tree/_tree.c sklearn/tree/_tree.pyx sklearn/tree/tests/test_tree.py

Fixed array2d->check_arrays

04098ff

Fixed exsample path depth check bug

35ad7f5

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

Adding cython generated _tree.c

7c0e173

Removed the feature contribution example from the pull request

6fe3bb7

Merge upstream

1a93cf7

Merge branch 'tree_paths' of https://github.com/andosa/scikit-learn i…

8f3eef6

…nto tree_paths

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

andosa added 2 commits April 22, 2015 00:27

Fix upstream merge

6b6d889

amueller reviewed Apr 27, 2015
View reviewed changes

PEP8 compatibility fixes

c5a33c4

arjoly reviewed Apr 30, 2015
View reviewed changes

andosa mentioned this pull request Apr 30, 2015

[MRG + 3] Store values for all nodes instead of just leaves for decision trees #4655

Closed

arjoly reviewed Jun 8, 2015
View reviewed changes

glouppe closed this Oct 19, 2015

glouppe mentioned this pull request Oct 19, 2015

Add an example showing how to extract prediction paths in forests #5441

Closed

Uh oh!

Option to return full decision paths when predicting with decision trees or random forest #2937

Option to return full decision paths when predicting with decision trees or random forest #2937

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!