8000 Option to return full decision paths when predicting with decision trees or random forest by andosa · Pull Request #2937 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Option to return full decision paths when predicting with decision trees or random forest #2937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

andosa
Copy link
@andosa andosa commented Mar 5, 2014

A very useful feature for decision trees is the option to access the full decision path for a prediction, i.e. the path from root to leaf for each decision.
This enables to interpret the model in the context of data in a very useful way. In particular, it allows to see exactly why the tree or forest arrived at a particular result, breaking down the prediction into exact components (feature contributions). This is much needed in certain application areas, for example in credit card fraud detection it is important to get an understanding why the model labels a transaction as fraudulent.

The pull request implements a predict option for random forest and decision tree: when return_paths keyword argument is set to True, paths are returned instead of predictions. I have not added docstrings yet, I assume the API might be expected to be different (another method instead of keyword argument to predict?)

In addition, there is a change to store values at each node, not just the leaves (useful when interpreting the tree).

@GaelVaroquaux
Copy link
Member

I assume the API might be expected to be different (another method
instead of keyword argument to predict?)

If we decide to go in this direction and add such a functionality (I must
say that I haven't understood the compeling argument to add it), I think
that it should be in the form of a new method, rather than overloading
the predict method.

@glouppe
Copy link
Contributor
glouppe commented Mar 8, 2014

Thanks for your contribution @andosa !

While I think returning full paths is an interesting feature, I would personnally be against adding your example compute_feature_contributions.py. What you call "feature contributions" is obviously something non-standard and seems to only work in a few situations (currently, your code only work for regression or binary classification, for a single output). I would rather let users implement such a functionality, assuming they know what they do, rather than maintaining this into master.

Also, as @GaelVaroquaux suggested, I wouldn't overload the predict method but would rather create a dedicated new method for computing prediction paths.

@jnothman
Copy link
Member
jnothman commented Mar 8, 2014

Part of the point is that scikit-learn's capability to explain its decisions currently requires a fairly deep understanding of the model internals. If the library will not provide an API for explaining its predictions (model introspection in general, as well as for particular input), then examples are probably a good idea, though I take your point that the analysis is non-standard in this case.

@glouppe
Copy link
Contributor
glouppe commented Mar 9, 2014

Part of the point is that scikit-learn's capability to explain its decisions currently requires a fairly deep understanding of the model internals. If the library will not provide an API for explaining its predictions (model introspection in general, as well as for particular input), then examples are probably a good idea, though I take your point that the analysis is non-standard in this case.

The exact contributions of features in a forest is computed with feature importances, as we currently have them. Computing the contribution of feature as did in the example of this PR, as if the output value in a tree was an additive function of the input features, really is something non-standard. (I don't say that it may not be useful though.)

…ading

the predict method
* Fixed a bug of decision path array being typed int32 instead of SIZE_t
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 068f05b on andosa:tree_paths into 4437d56 on scikit-learn:master.

@andosa
Copy link
Author
andosa commented Mar 11, 2014

I've lifted the code into a separate function instead of overloading predict method.

What you call "feature contributions" is obviously something non-standard and seems to only work in a few situations (currently, your code only work for regression or binary classification, for a single output)

Not sure what you mean by "few situations". If you mean single output only, it will work exactly the same way for multi-output trees, where you would simply have a list of contributions (instead of just one) from each feature: one contribution per output.
I could add a utility function that takes the sample frame as input and returns a table shaped (n_samples* (1 + n_features) * n_outputs), each row representing the contribution from each feature (+ trainset mean), summing up to the prediction you would get from the predict method. This would make the example less "ad hoc" and more reusable perhaps.

I do agree that the additive representation of tree predictions is somewhat non-standard, in the sense that its typically not something in textbooks. At the same time i feel it is something more people should know about, since it is very useful in understanding the model behavior on particular data (vs the static view you get from feature importances)

Ando Saabas added 3 commits August 25, 2014 14:23
Conflicts:
	sklearn/ensemble/forest.py
	sklearn/tree/_tree.c
	sklearn/tree/_tree.pyx
	sklearn/tree/tests/test_tree.py
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling 7c0e173 on andosa:tree_paths into 4b82379 on scikit-learn:master.

@jnothman
Copy link
Member

I do agree that the additive representation of tree predictions is somewhat non-standard, in the sense that its typically not something in textbooks. At the same time i feel it is something more people should know about, since it is very useful in understanding the model behavior on particular data (vs the static view you get from feature importances)

You say typically. Can you find a reference to this or another technique for quantitative decision path inspection?

@andosa
Copy link
Author
andosa commented Aug 27, 2014

On the additive representation of predictions, see for example http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6642461

But even regardless of this particular technique, the ability to inspect the decision
paths (along with the estimates in the nodes in addition to leaves) is something that is very useful to have in the data scientist's toolbox.

@arjoly
Copy link
Member
arjoly commented Aug 28, 2014

Hi,

I find this feature highly interesting. However, I have the feeling that it would more appropriate as a public function that take a fitted/unfitted estimator and call the "private" method of the Tree structure. Instead of adding a new method to the api.

I haven't look in detail at the current example. But I have the feeling that this should be discussed or made in another pull request.

As an example, I would have expected something simpler: a fitted decision tree with the path of one samples in this tree (maybe on iris?).

@jnothman
Copy link
Member

it would more appropriate as a public function that take a fitted/unfitted estimator and call the "private" method of the Tree structure. Instead of adding something new method to the api.

I'm not sure what you mean. Could you give a usage example, @arjoly?

@arjoly
Copy link
Member
arjoly commented Aug 28, 2014

Sorry for the unclear reasonning.
I will try to clarify.

For the functionality in this pr, I would use it like that

est = DecisionTreeClassifier(...).fit(X, y)
for s in interesting_samples:
    path = compute_decision_path(est, X[s])
    # Plot decision tree and path

where compute_decision_path would be the new features added by this pr. The compute_decision_path would indeed called internally est.tree_.decision_path(X).

Thinking about it an alternative option could be to improve the apply method.

@jnothman
Copy link
Member

Keep apply() lightweight, IMO.

On 28 August 2014 21:47, Arnaud Joly notifications@github.com wrote:

Sorry for the unclear reasonning.
I will try to clarify.

For the functionality in this pr, I would use it like that

est = DecisionTreeClassifier(...).fit(X, y)
for s in interesting_samples:
path = compute_decision_path(est, X[s])
# Plot decision tree and paths

where compute_decision_path would be the new features added by this pr.
The compute_decision_path would indeed called internally
est.tree_.decision_path(X).

Thinking about it a better option could be to improve the apply method
https://github.com/andosa/scikit-learn/blob/tree_paths/sklearn/ensemble/forest.py#L142
.


Reply to this email directly or view it on GitHub
#2937 (comment)
.

@andosa
Copy link
8000 Author
andosa commented Aug 28, 2014

Which module would compute_decision_path live in? Something like tree/tree_utils.py?

@jnothman
Copy link
Member

fwiw, I think I'd prefer to see the method as it is here. Rather, I've not understood the benefit of @arjoly's proposal.

On 28 August 2014 22:35, andosa notifications@github.com wrote:

Which module would compute_decision_path live in? Something like
tree/tree_utils.py?


Reply to this email directly or view it on GitHub
#2937 (comment)
.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling 6fe3bb7 on andosa:tree_paths into 4b82379 on scikit-learn:master.

@andosa
Copy link
Author
andosa commented Sep 3, 2014

I removed the feature contribution example. Agreed that it's probably a bit too involved and obscure to be in the examples list.

The pull request now includes nothing but two conceptual changes that both help with model and prediction inspection:

  • Values for trained trees are kept for each node, not just for leaves
  • Decision tree and random forest have a method to return paths.

As such, i think the pull request is very straightforward and non-controversial. Anything else I could do to speed up merging?

andosa added 2 commits April 22, 2015 00:27
Conflicts:
	appveyor.yml
	doc/modules/cross_validation.rst
	doc/modules/kernel_approximation.rst
	doc/modules/model_evaluation.rst
	doc/modules/sgd.rst
	doc/modules/unsupervised_reduction.rst
	doc/whats_new.rst
	sklearn/__init__.py
	sklearn/cluster/tests/test_bicluster.py
	sklearn/cluster/tests/test_hierarchical.py
	sklearn/cluster/tests/test_spectral.py
	sklearn/datasets/base.py
	sklearn/datasets/tests/test_svmlight_format.py
	sklearn/decomposition/base.py
	sklearn/decomposition/incremental_pca.py
	sklearn/decomposition/tests/test_incremental_pca.py
	sklearn/ensemble/bagging.py
	sklearn/ensemble/gradient_boosting.py
	sklearn/ensemble/tests/test_gradient_boosting.py
	sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py
	sklearn/ensemble/weight_boosting.py
	sklearn/feature_extraction/dict_vectorizer.py
	sklearn/feature_extraction/text.py
	sklearn/feature_selection/tests/test_feature_select.py
	sklearn/feature_selection/variance_threshold.py
	sklearn/gaussian_process/tests/test_gaussian_process.py
	sklearn/grid_search.py
	sklearn/linear_model/base.py
	sklearn/linear_model/cd_fast.c
	sklearn/linear_model/cd_fast.pyx
	sklearn/linear_model/coordinate_descent.py
	sklearn/linear_model/logistic.py
	sklearn/linear_model/ridge.py
	sklearn/linear_model/sgd_fast.c
	sklearn/linear_model/sgd_fast.pyx
	sklearn/linear_model/stochastic_gradient.py
	sklearn/linear_model/tests/test_coordinate_descent.py
	sklearn/linear_model/tests/test_least_angle.py
	sklearn/linear_model/tests/test_logistic.py
	sklearn/linear_model/tests/test_sgd.py
	sklearn/metrics/regression.py
	sklearn/metrics/tests/test_score_objects.py
	sklearn/neighbors/base.py
	sklearn/neighbors/kde.py
	sklearn/neighbors/nearest_centroid.py
	sklearn/neighbors/tests/test_nearest_centroid.py
	sklearn/neighbors/tests/test_neighbors.py
	sklearn/preprocessing/data.py
	sklearn/preprocessing/tests/test_data.py
	sklearn/svm/base.py
	sklearn/svm/bounds.py
	sklearn/svm/classes.py
	sklearn/svm/tests/test_svm.py
	sklearn/tests/test_common.py
	sklearn/tests/test_grid_search.py
	sklearn/tests/test_isotonic.py
	sklearn/tests/test_pipeline.py
	sklearn/tree/_tree.c
	sklearn/tree/_tree.pxd
	sklearn/tree/_tree.pyx
	sklearn/tree/tests/test_tree.py
	sklearn/tree/tree.py
	sklearn/utils/__init__.py
	sklearn/utils/extmath.py
	sklearn/utils/sparsefuncs.py
	sklearn/utils/testing.py
	sklearn/utils/tests/test_extmath.py
	sklearn/utils/tests/test_utils.py
	sklearn/utils/tests/test_validation.py
	sklearn/utils/validation.py
@andosa
Copy link
Author
andosa commented Apr 23, 2015

Could we revisit this PR? Tree code seems to have stabilized now, and this is simple but a very useful feature for a large number of use cases (that could be moved to a private function)

@amueller
Copy link
Member

@jnothman I think the idea behind @arjoly's proposal is to not clutter the interface of the forest too much with very specialized methods. I can see where he is coming from, but I feel we rarely take that road, and it makes a feature a lot less discoverable. I am undecided on the issue.

if path[i] == -1:
break
node_id = path[i]
pred += clf.tree_.value[node_id][0][0] - base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two white spaces after "=". Can you run pep8 on the file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commited pep8 fixes

@amueller
Copy link
Member

@glouppe @arjoly wdyt?


# Check data
if getattr(X, "dtype", None) != DTYPE or X.ndim != 2:
X = array2d(X, dtype=DTYPE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the _validate_X_predict?

@arjoly
Copy link
Member
arjoly commented Apr 30, 2015

@jnothman I think the idea behind @arjoly's proposal is to not clutter the interface of the forest too much with very specialized methods. I can see where he is coming from, but I feel we rarely take that road, and it makes a feature a lot less discoverable. I am undecided on the issue.

Other opinion on this? (@glouppe ?)


#def _parallel_predict_paths(trees, X):
# """Private function used to compute a batch of prediction paths within a job."""
# return [tree.decision_paths(X) for tree in trees]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this?

@arjoly
Copy link
Member
arjoly commented Jun 12, 2015

@andosa if you need some help for the sparse algorithm, I may find some time.

@andosa
Copy link
Author
andosa commented Aug 12, 2015

Since 0.17 now has both value recording in the intermediate nodes, and the option to return node_ids for predictions via apply, the paths can be extracted via those. So explicit method for that isn't actually needed anymore. I think this one should be closed?

I wrote an example on how to extract the paths with current master to decompose every prediction into bias + feature contribution components (such that for each prediction, prediction = bias + feature_1_contribution + ... + feature_n_contribution).
Writeup at http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/.
Code at https://github.com/andosa/treeinterpreter.

There have been a lot of people that do practical data science that have found this very useful (and i don't know of any other ML library that exposes this). Do you think it might be worth including into scikit-learns examples or even tree utils (like export_graphviz).

@arjoly
Copy link
Member
arjoly commented Aug 13, 2015

Since the apply method returns the leaves id and not the node id, I still think that there is some place for the decision node path.

@andosa
Copy link
Author
andosa commented Aug 13, 2015

But isn't the leave id the node_id of the leaf? And since you have exactly one path to each leaf, you have a mapping from leave ids' to paths.

@arjoly
Copy link
Member
arjoly commented Aug 13, 2015

Yes, however computing the path in python is likely to be slow. There are feature extraction methods based on the extraction of node indicator features for boosting and random forest based algorithm.

@andosa
Copy link
Author
andosa commented Aug 13, 2015

Computing all the paths is a one off procedure, just one DFS, after which you have a lookup table that can map leaf id -> path.

@amueller
Copy link
Member

I agree that we should try to avoid too much cython.
I think an example would be nice.

@glouppe
Copy link
Contributor
glouppe commented Oct 19, 2015

Closing this.

I'll create an open issue to have an example showing how to extract prediction paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0