Add custom_range argument for partial dependence #21033

freddyaboulton · 2021-09-13T19:32:22Z

Reference Issues/PRs

Fixes #20890

What does this implement/fix? Explain your changes.

This PR allows users to specify a custom_range of values to calculate partial depedency for some or all of the features.

The api is custom_range={feature: array-like of grid values}.

Any other comments?

freddyaboulton · 2021-09-14T16:25:51Z

For whoever reviews this PR, where in the changelog I should add this change?

thomasjpfan · 2021-09-14T19:02:42Z

For whoever reviews this PR, where in the changelog I should add this change?

doc/whats_new/v1.1.rst

thomasjpfan

Thank you for starting this PR @freddyaboulton !

I am open to a better name than custom_grid, the dictionary's value is more like a range.

thomasjpfan · 2021-09-15T02:16:10Z

sklearn/inspection/_partial_dependence.py

+        the partial dependence should be calculated for that feature. The length
+        of `custom_grid` must match the length of `features`.


I think the requirement that len(custom_grid) == len(features) is not a great user experience.

I was thinking of extending _grid_from_X to support custom_grid, and if custom_grid is provide for a given feature, then do minimal validation and add it to values.

freddyaboulton · 2021-09-15T16:34:29Z

@thomasjpfan Thank you for the feedback! I changed the name from custom_grid to custom_range and changed the implementation so that custom_range can include a subset of the features.

thomasjpfan

Thank you for the follow up! I made another pass through the PR.

thomasjpfan · 2021-10-09T00:37:17Z

sklearn/inspection/_partial_dependence.py

        values.append(axis)

-    return cartesian(values), values
+    # Store cartesian in grid of dtype=object to support grids of str/numeric values
+    shape = (len(v) for v in values)


This would increase memory overhead for a grid that is all numerical. We need to use objects only when it is necessary, otherwise numerical ndarrays would be good enough.

Done! If there are mixed types (numeric/non-numeric) we use object, else use the first dtype (default behavior of cartesian)

thomasjpfan · 2021-10-09T01:24:18Z

sklearn/inspection/_partial_dependence.py

@@ -321,6 +344,11 @@ def partial_dependence(
            `kind='average'` will be the new default. It is intended to migrate
            from the ndarray output to :class:`~sklearn.utils.Bunch` output.

+    custom_range: dict


Suggested change

custom_range: dict

custom_range: dict, default=None

Given the language of "values" in the docstring, maybe custom_values would be a better name. What do you think?

Sounds good to me!

thomasjpfan · 2021-10-09T02:29:31Z

sklearn/inspection/_partial_dependence.py

+    if isinstance(features, (str, int, float, bool)):
+        features = [features]


I do not think features can be a single bool or a float:

Suggested change

if isinstance(features, (str, int, float, bool)):

features = [features]

if isinstance(features, (str, int)):

features = [features]

You are right. Will make this change.

thomasjpfan · 2021-10-09T03:01:47Z

sklearn/inspection/_partial_dependence.py

+    custom_range = custom_range or {}
+    if isinstance(features, (str, int, float, bool)):
+        features = [features]
+    custom_range = {


To not override custom_range while also reading custom_range:

Suggested change

custom_range = {

custom_range_idx = {

thomasjpfan · 2021-10-09T03:05:31Z

sklearn/inspection/_partial_dependence.py

+    if isinstance(features, (str, int, float, bool)):
+        features = [features]
+    custom_range = {
+        index: custom_range.get(feature)


I do not think this works when feature is a mask. I do not think we officially support mask given that the docstring does not mention it.

@glemaitre Was mask support intentional?

If it works with boolean, it was not intentional as you mentioned. It is not tested at least.

boolean mask is tested here:

scikit-learn/sklearn/inspection/tests/test_partial_dependence.py

Line 725 in 5f6e17c

([True, False, True, False], (3, 10, 10)),

Thank you for pointing this out @thomasjpfan !

I can change the api so that custom_range (or custom_values as we will call it now) is a list of array-like rather than a dictionary. The implementation will assume (and the docstring will make clear) that the order of values in custom_values will correspond to the order of the features argument.

That way custom_values supports all types of features argument. lDoes that sound good?

Perhaps we should also file a separate issue for making clear that boolean masks are allowed given that it's tested and supported but not documented?

Perhaps we should also file a separate issue for making clear that boolean masks are allowed given that it's tested and supported but not documented?

In this case, I think we follow the docs and that boolean support was by unintentional.

I can change the api so that custom_range (or custom_values as we will call it now) is a list of array-like rather than a dictionary.

I think that an array-like UX would not be great. Imagine having 20 features and wanting to specific the range for feature 10. custom_values would look like:

custom_values = [None, None, ..., None, [0, 1, 2, 3, 4], None, ..., None]

Having a custom_values argument will solve different problems - I just wanted to ask for this feature. Great it is almost available! I think a dict would be a good interface to pass the evaluation points. The keys would be a subset of features (and of the corresponding type).

So, e.g.,

from plotnine.data import diamonds from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import OrdinalEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import make_pipeline from sklearn.inspection import PartialDependenceDisplay # Ordered categoricals to be integer encoded in correct order ord_vars = ["color", "cut", "clarity"] ord_levels = [ ['D', 'E', 'F', 'G', 'H', 'I', 'J'], ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'] ] # Modeling model = make_pipeline( ColumnTransformer( transformers=[ ("linear", "passthrough", ["carat"]), ("ordered", OrdinalEncoder(categories=ord_levels), ord_vars) ], ), RandomForestRegressor(max_features="sqrt", min_samples_leaf=5) ) model.fit(diamonds, y=diamonds["price"]); # Does not work yet PartialDependenceDisplay.from_estimator( model, diamonds, n_jobs=8, features=["carat"] + ord_vars, custom_values=dict(zip(ord_vars, ord_levels)) )

@thomasjpfan @mayer79 Thank you for the comments and feedback. I will keep the api as is then since the boolean mask is not officially supported!

freddyaboulton · 2021-10-29T16:52:54Z

@thomasjpfan I forgot to mention I addressed the latest batch of comments!

bchen1116

Really cool work! Left a question on whether custom_values can default to an empty dictionary

bchen1116 · 2021-11-02T22:00:48Z

sklearn/inspection/_partial_dependence.py

@@ -58,6 +58,10 @@ def _grid_from_X(X, percentiles, grid_resolution):
        The number of equally spaced points to be placed on the grid for each
        feature.

+    custom_values: dict
+        Mapping from column index of X to an array-like of values where
+        the partial dependence should be calculated for that feature


nit: period here

"Mapping from column index of X" isn't accurate, it's actually mapping the element number from the list of user-provided features. Not sure if this will have downstream effects.
i.e. If user specifies that they want PDP for feature 5 only, the key generated in the partial_dependence function will assign this feature as 0, not 5.

I think the docstring is accurate. At this point, X has been subset to the features in the features array with _safe_indexing and the custom_values array is modified so that it maps the column index in the subset array to the array of values.

bchen1116 · 2021-11-02T22:05:47Z

sklearn/inspection/_partial_dependence.py

@@ -36,7 +36,7 @@
 ]


-def _grid_from_X(X, percentiles, grid_resolution):
+def _grid_from_X(X, percentiles, grid_resolution, custom_values):


is it possible to default custom_values as an empty dict? This avoids the case where a value needs to be passed in to this as an argument when it isn't necessary

I believe it needs to default to None, then be replaced by an empty dictionary if it's None.

InterferencePattern

This is a great feature you've added to PDP. Look forward to getting it nailed down and released!

InterferencePattern · 2021-11-09T17:17:04Z

sklearn/inspection/_partial_dependence.py

-                _safe_indexing(X, feature, axis=1), prob=percentiles, axis=0
-            )
-            if np.allclose(emp_percentiles[0], emp_percentiles[1]):
+        if feature in custom_values:


It would make sense to rename feature to feature_idx since it's not the feature name.

custom_values is also index-based, and perhaps the name should reflect that (since custom_values is also an argument in the partial_dependence function, and has a different structure.

InterferencePattern · 2021-11-09T17:54:40Z

sklearn/inspection/_partial_dependence.py

@@ -58,6 +58,10 @@ def _grid_from_X(X, percentiles, grid_resolution):
        The number of equally spaced points to be placed on the grid for each
        feature.

+    custom_values: dict
+        Mapping from column index of X to an array-like of values where
+        the partial dependence should be calculated for that feature


"Mapping from column index of X" isn't accurate, it's actually mapping the element number from the list of user-provided features. Not sure if this will have downstream effects.
i.e. If user specifies that they want PDP for feature 5 only, the key generated in the partial_dependence function will assign this feature as 0, not 5.

InterferencePattern · 2022-01-31T20:35:02Z

I have another question- does this work with PartialDependenceDisplay object's required deciles attribute?

EDIT: A better way to ask this is whether or not anyone has managed to plot the results from this function with either plot_partial_dependence or PartialDependenceDisplay.

InterferencePattern · 2022-02-14T21:49:51Z

sklearn/inspection/_partial_dependence.py

-    return cartesian(values), values
+    # Create a place holder for the cartesian product of the individual grids.
+    shape = (len(v) for v in values)
+    ix = np.indices(shape)


It's probably a bad idea to create an array with a new dimension for every single feature, because this line will fail for >32 features.

Traceback (most recent call last): File ".../lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 521, in partial_dependence custom_values_idx File ".../lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 120, in _grid_from_X ix = np.indices(shape) File ".../lib/python3.7/site-packages/numpy/core/numeric.py", line 1777, in indices res = empty((N,)+dimensions, dtype=dtype) ValueError: maximum supported dimension for an ndarray is 32, found 792

You're correct but that limitation exists on the current main branch

from sklearn.datasets import load_digits from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import partial_dependence import pytest data = load_digits() X = data.data y = data.target rf = RandomForestClassifier() rf.fit(X, y) with pytest.raises(ValueError, match="maximum supported dimension for an ndarray is 32, found 33"): partial_dependence(rf, X, features=list(range(32)))

So if we want to support more than 32-way partial dependence, that's a separate ticket.

…bset of features.

freddyaboulton · 2022-02-18T22:11:17Z

@thomasjpfan Can you please give this another look when you get a chance? Thanks!

thomasjpfan

Thanks for the update!

thomasjpfan · 2022-02-19T12:36:37Z

sklearn/inspection/_plot/tests/test_plot_partial_dependence.py

+    if use_custom_values:
+        custom_values = {
+            "age": custom_values_helper(age, grid_resolution),
+            "bmi": custom_values_helper(bmi, grid_resolution),


test_plot_partial_dependence.py is already one of the longer duration test files, I would prefer not to parametrize everything with use_custom_values.

I think the test_partial_dependence_pipeline_custom_values and test_grid_from_X covers the new code well.

thomasjpfan · 2022-02-19T12:38:42Z

sklearn/inspection/_partial_dependence.py

+    # numeric/non-numeric features we use object dtype.
+    # Else, we use the first dtype in values,
+    # which is the default behavior of the cartesian function.
+    dtypes = [arr.dtype for arr in values]


I know your original issue is about inputting strings, but I think we can reduce the scope of this PR by supporting numerical custom_values first and then add in object/str support later.

(Reducing the scope makes it easier to review and usually results in merging faster)

thomasjpfan · 2022-02-19T12:39:53Z

sklearn/inspection/_partial_dependence.py

+            feature_range = custom_values[feature]
+            if not isinstance(feature_range, np.ndarray):
+                feature_range = np.array(feature_range)


np.asarray is a noop when the input is an ndarray:

Suggested change

feature_range = custom_values[feature]

if not isinstance(feature_range, np.ndarray):

feature_range = np.array(feature_range)

feature_range = np.asarray(custom_values[feature])

(while np.array will always make a copy)

thomasjpfan · 2022-02-19T13:07:44Z

sklearn/inspection/_partial_dependence.py

+                    "Grid for feature {} is not a one-dimensional array. Got {}"
+                    " dimensions".format(feature, feature_range.ndim)
+                )
+            axis = feature_range


I'll prefer to error check before hand the loop since checking custom_values is more of an input validation: (It also reduces the code that is indented:

custom_values = {k: np.asarray(v) for k, v in custom_values.items()} if any(v.ndim != 1 for v in custom_values.values()): raise ValueError(...) for feature in range(X.shape[1]): if feature in custom_values: axis = custom_values[feature] else: ...

In the erorr message, I think it's enough to say that custom_values contain non-1d data. The feature index does not correspond to the location of the original X anymore so it can be confusing.

Also, we can add a new test to test_grid_from_X_error to check the custom_values validation

thomasjpfan · 2022-02-19T13:22:24Z

sklearn/inspection/_partial_dependence.py

+    custom_values_idx = {
+        index: custom_values.get(feature)
+        for index, feature in enumerate(features)
+        if feature in custom_values
+    }


I think it's fairly unclear that custom_values_idx maps from the sliced X. I think we can make it more clear by doing:

X_subset = _safe_indexing(X, features_indices, axis=1) custom_values = custom_values or {} custom_values_for_X_subset = {...} grid, values = _grid_from_X(X_subset, ...)

stephenpardy · 2023-04-02T14:53:37Z

@freddyaboulton, @thomasjpfan I am also interested in this feature. There hasn't been any activity on this PR in over a year and I would be interested in taking it on.

stephenpardy · 2023-04-17T18:15:42Z

@thomasjpfan can we switch the source to my branch - https://github.com/stephenpardy/scikit-learn/tree/20890-partial-dependence-custom-grid? I talked to Freddy and I will be taking over the PR - I will address the issues brought up in the last comments.

thomasjpfan · 2023-04-17T18:32:42Z

@freddyaboulton Can you confirm that you are okay with @stephenpardy continuing the work on this PR?

@stephenpardy It's best to open another PR that references this one and states that it is superseding this PR. The new PR's opening commit should still link back to the original issue and describe its updates.

freddyaboulton · 2023-04-17T18:34:49Z

Hi @thomasjpfan ! Yes @stephenpardy can take over

github-actions bot added the module:inspection label Sep 13, 2021

freddyaboulton marked this pull request as ready for review September 13, 2021 20:40

freddyaboulton mentioned this pull request Sep 13, 2021

Allow users to pass their own grid to partial dependence #20890

Closed

thomasjpfan reviewed Sep 15, 2021

View reviewed changes

freddyaboulton changed the title ~~Add custom_grid argument for partial dependence~~ Add custom_range argument for partial dependence Sep 15, 2021

freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch from 2f96fc0 to f2195ce Compare September 17, 2021 16:56

thomasjpfan reviewed Oct 9, 2021

View reviewed changes

freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch 2 times, most recently from 9ffac83 to 701c868 Compare October 27, 2021 16:04

bchen1116 reviewed Nov 2, 2021

View reviewed changes

InterferencePattern suggested changes Nov 9, 2021

View reviewed changes

freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch from 53b5a83 to d51cbe6 Compare February 3, 2022 19:41

InterferencePattern mentioned this pull request Feb 7, 2022

Support for custom grid point input for ALE SeldonIO/alibi#586

Closed

InterferencePattern reviewed Feb 14, 2022

View reviewed changes

freddyaboulton and others added 11 commits February 18, 2022 14:25

Add custom_grid argument for partial dependence

f4884e5

Add 21033 to release notes

b254953

Fix spacing

b4d757a

Rename custom_grid to custom_range. custom_range can now contain a su…

93bf98c

…bset of features.

Rename custom_grid to custom_range in changelog

19371d6

Rename custom_grid to custom_range in tests

36c71c1

Use object only when necessary. Change name to custom_values

50c4b48

Add comment

d8529dc

Add newline to prevent sphinx warning

aaf5b25

Added custom_values to PartialDependenceDisplay plotting

d297027

Linting

69ca9f0

freddyaboulton added 3 commits February 18, 2022 14:27

Add space to fix docs

065833d

Fix docs

ab5c00b

Fix docs in grid_from_X

778203a

freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch from 1f9d38d to 778203a Compare February 18, 2022 19:37

freddyaboulton added 2 commits February 18, 2022 14:56

Fix docstring

6fdbda4

Add plotting tests

49062f9

thomasjpfan reviewed Feb 19, 2022

View reviewed changes

stephenpardy mentioned this pull request Mar 24, 2023

Allow users to have custom PDP/ICE axes #25954

Closed

stephenpardy mentioned this pull request Apr 17, 2023

ENH Add custom_range argument for partial dependence - version 2 #26202

Merged

adrinjalali closed this Mar 7, 2024

		the partial dependence should be calculated for that feature. The length
		of `custom_grid` must match the length of `features`.

		if isinstance(features, (str, int, float, bool)):
		features = [features]

Uh oh!

Add custom_range argument for partial dependence #21033

Add custom_range argument for partial dependence #21033

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!