-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add custom_range argument for partial dependence #21033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom_range argument for partial dependence #21033
Conversation
For whoever reviews this PR, where in the changelog I should add this change? |
doc/whats_new/v1.1.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for starting this PR @freddyaboulton !
I am open to a better name than custom_grid
, the dictionary's value is more like a range.
the partial dependence should be calculated for that feature. The length | ||
of `custom_grid` must match the length of `features`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the requirement that len(custom_grid) == len(features)
is not a great user experience.
I was thinking of extending _grid_from_X
to support custom_grid
, and if custom_grid
is provide for a given feature, then do minimal validation and add it to values
.
@thomasjpfan Thank you for the feedback! I changed the name from |
2f96fc0
to
f2195ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the follow up! I made another pass through the PR.
values.append(axis) | ||
|
||
return cartesian(values), values | ||
# Store cartesian in grid of dtype=object to support grids of str/numeric values | ||
shape = (len(v) for v in values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would increase memory overhead for a grid that is all numerical. We need to use objects only when it is necessary, otherwise numerical ndarrays would be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! If there are mixed types (numeric/non-numeric) we use object, else use the first dtype (default behavior of cartesian
)
@@ -321,6 +344,11 @@ def partial_dependence( | |||
`kind='average'` will be the new default. It is intended to migrate | |||
from the ndarray output to :class:`~sklearn.utils.Bunch` output. | |||
|
|||
custom_range: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
custom_range: dict | |
custom_range: dict, default=None |
Given the language of "values" in the docstring, maybe custom_values
would be a better name. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if isinstance(features, (str, int, float, bool)): | ||
features = [features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think features
can be a single bool or a float:
if isinstance(features, (str, int, float, bool)): | |
features = [features] | |
if isinstance(features, (str, int)): | |
features = [features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Will make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
custom_range = custom_range or {} | ||
if isinstance(features, (str, int, float, bool)): | ||
features = [features] | ||
custom_range = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To not override custom_range
while also reading custom_range
:
custom_range = { | |
custom_range_idx = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if isinstance(features, (str, int, float, bool)): | ||
features = [features] | ||
custom_range = { | ||
index: custom_range.get(feature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this works when feature
is a mask. I do not think we officially support mask given that the docstring does not mention it.
@glemaitre Was mask support intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works with boolean, it was not intentional as you mentioned. It is not tested at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean mask is tested here:
([True, False, True, False], (3, 10, 10)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out @thomasjpfan !
I can change the api so that custom_range
(or custom_values
as we will call it now) is a list of array-like rather than a dictionary. The implementation will assume (and the docstring will make clear) that the order of values in custom_values
will correspond to the order of the features
argument.
That way custom_values
supports all types of features
argument. lDoes that sound good?
Perhaps we should also file a separate issue for making clear that boolean masks are allowed given that it's tested and supported but not documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should also file a separate issue for making clear that boolean masks are allowed given that it's tested and supported but not documented?
In this case, I think we follow the docs and that boolean support was by unintentional.
I can change the api so that custom_range (or custom_values as we will call it now) is a list of array-like rather than a dictionary.
I think that an array-like UX would not be great. Imagine having 20 features and wanting to specific the range for feature 10. custom_values
would look like:
custom_values = [None, None, ..., None, [0, 1, 2, 3, 4], None, ..., None]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a custom_values
argument will solve different problems - I just wanted to ask for this feature. Great it is almost available! I think a dict would be a good interface to pass the evaluation points. The keys would be a subset of features
(and of the corresponding type).
So, e.g.,
from plotnine.data import diamonds
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.inspection import PartialDependenceDisplay
# Ordered categoricals to be integer encoded in correct order
ord_vars = ["color", "cut", "clarity"]
ord_levels = [
['D', 'E', 'F', 'G', 'H', 'I', 'J'],
['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],
['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
]
# Modeling
model = make_pipeline(
ColumnTransformer(
transformers=[
("linear", "passthrough", ["carat"]),
("ordered", OrdinalEncoder(categories=ord_levels), ord_vars)
],
),
RandomForestRegressor(max_features="sqrt", min_samples_leaf=5)
)
model.fit(diamonds, y=diamonds["price"]);
# Does not work yet
PartialDependenceDisplay.from_estimator(
model,
diamonds,
n_jobs=8,
features=["carat"] + ord_vars,
custom_values=dict(zip(ord_vars, ord_levels))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan @mayer79 Thank you for the comments and feedback. I will keep the api as is then since the boolean mask is not officially supported!
9ffac83
to
701c868
Compare
@thomasjpfan I forgot to mention I addressed the latest batch of comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool work! Left a question on whether custom_values
can default to an empty dictionary
@@ -58,6 +58,10 @@ def _grid_from_X(X, percentiles, grid_resolution): | |||
The number of equally spaced points to be placed on the grid for each | |||
feature. | |||
|
|||
custom_values: dict | |||
Mapping from column index of X to an array-like of values where | |||
the partial dependence should be calculated for that feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: period here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Mapping from column index of X" isn't accurate, it's actually mapping the element number from the list of user-provided features. Not sure if this will have downstream effects.
i.e. If user specifies that they want PDP for feature 5 only, the key generated in the partial_dependence
function will assign this feature as 0, not 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docstring is accurate. At this point, X
has been subset to the features in the features
array with _safe_indexing
and the custom_values
array is modified so that it maps the column index in the subset array to the array of values.
@@ -36,7 +36,7 @@ | |||
] | |||
|
|||
|
|||
def _grid_from_X(X, percentiles, grid_resolution): | |||
def _grid_from_X(X, percentiles, grid_resolution, custom_values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to default custom_values
as an empty dict? This avoids the case where a value needs to be passed in to this as an argument when it isn't necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it needs to default to None
, then be replaced by an empty dictionary if it's None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great feature you've added to PDP. Look forward to getting it nailed down and released!
_safe_indexing(X, feature, axis=1), prob=percentiles, axis=0 | ||
) | ||
if np.allclose(emp_percentiles[0], emp_percentiles[1]): | ||
if feature in custom_values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to rename feature
to feature_idx
since it's not the feature name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
custom_values
is also index-based, and perhaps the name should reflect that (since custom_values
is also an argument in the partial_dependence
function, and has a different structure.
@@ -58,6 +58,10 @@ def _grid_from_X(X, percentiles, grid_resolution): | |||
The number of equally spaced points to be placed on the grid for each | |||
feature. | |||
|
|||
custom_values: dict | |||
Mapping from column index of X to an array-like of values where | |||
the partial dependence should be calculated for that feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Mapping from column index of X" isn't accurate, it's actually mapping the element number from the list of user-provided features. Not sure if this will have downstream effects.
i.e. If user specifies that they want PDP for feature 5 only, the key generated in the partial_dependence
function will assign this feature as 0, not 5.
I have another question- does this work with EDIT: A better way to ask this is whether or not anyone has managed to plot the results from this function with either |
53b5a83
to
d51cbe6
Compare
return cartesian(values), values | ||
# Create a place holder for the cartesian product of the individual grids. | ||
shape = (len(v) for v in values) | ||
ix = np.indices(shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a bad idea to create an array with a new dimension for every single feature, because this line will fail for >32 features.
Traceback (most recent call last):
File ".../lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 521, in partial_dependence
custom_values_idx
File ".../lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 120, in _grid_from_X
ix = np.indices(shape)
File ".../lib/python3.7/site-packages/numpy/core/numeric.py", line 1777, in indices
res = empty((N,)+dimensions, dtype=dtype)
ValueError: maximum supported dimension for an ndarray is 32, found 792
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct but that limitation exists on the current main branch
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import partial_dependence
import pytest
data = load_digits()
X = data.data
y = data.target
rf = RandomForestClassifier()
rf.fit(X, y)
with pytest.raises(ValueError, match="maximum supported dimension for an ndarray is 32, found 33"):
partial_dependence(rf, X, features=list(range(32)))
So if we want to support more than 32-way partial dependence, that's a separate ticket.
…bset of features.
1f9d38d
to
778203a
Compare
@thomasjpfan Can you please give this another look when you get a chance? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
if use_custom_values: | ||
custom_values = { | ||
"age": custom_values_helper(age, grid_resolution), | ||
"bmi": custom_values_helper(bmi, grid_resolution), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_plot_partial_dependence.py
is already one of the longer duration test files, I would prefer not to parametrize
everything with use_custom_values
.
I think the test_partial_dependence_pipeline_custom_values
and test_grid_from_X
covers the new code well.
# numeric/non-numeric features we use object dtype. | ||
# Else, we use the first dtype in values, | ||
# which is the default behavior of the cartesian function. | ||
dtypes = [arr.dtype for arr in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know your original issue is about inputting strings, but I think we can reduce the scope of this PR by supporting numerical custom_values
first and then add in object/str support later.
(Reducing the scope makes it easier to review and usually results in merging faster)
feature_range = custom_values[feature] | ||
if not isinstance(feature_range, np.ndarray): | ||
feature_range = np.array(feature_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.asarray
is a noop when the input is an ndarray
:
feature_range = custom_values[feature] | |
if not isinstance(feature_range, np.ndarray): | |
feature_range = np.array(feature_range) | |
feature_range = np.asarray(custom_values[feature]) |
(while np.array
will always make a copy)
"Grid for feature {} is not a one-dimensional array. Got {}" | ||
" dimensions".format(feature, feature_range.ndim) | ||
) | ||
axis = feature_range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll prefer to error check before hand the loop since checking custom_values
is more of an input validation: (It also reduces the code that is indented:
custom_values = {k: np.asarray(v) for k, v in custom_values.items()}
if any(v.ndim != 1 for v in custom_values.values()):
raise ValueError(...)
for feature in range(X.shape[1]):
if feature in custom_values:
axis = custom_values[feature]
else:
...
In the erorr message, I think it's enough to say that custom_values
contain non-1d data. The feature index does not correspond to the location of the original X
anymore so it can be confusing.
Also, we can add a new test to test_grid_from_X_error
to check the custom_values
validation
custom_values_idx = { | ||
index: custom_values.get(feature) | ||
for index, feature in enumerate(features) | ||
if feature in custom_values | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fairly unclear that custom_values_idx
maps from the sliced X
. I think we can make it more clear by doing:
X_subset = _safe_indexing(X, features_indices, axis=1)
custom_values = custom_values or {}
custom_values_for_X_subset = {...}
grid, values = _grid_from_X(X_subset, ...)
@freddyaboulton, @thomasjpfan I am also interested in this feature. There hasn't been any activity on this PR in over a year and I would be interested in taking it on. |
@thomasjpfan can we switch the source to my branch - https://github.com/stephenpardy/scikit-learn/tree/20890-partial-dependence-custom-grid? I talked to Freddy and I will be taking over the PR - I will address the issues brought up in the last comments. |
@freddyaboulton Can you confirm that you are okay with @stephenpardy continuing the work on this PR? @stephenpardy It's best to open another PR that references this one and states that it is superseding this PR. The new PR's opening commit should still link back to the original issue and describe its updates. |
Hi @thomasjpfan ! Yes @stephenpardy can take over |
Reference Issues/PRs
Fixes #20890
What does this implement/fix? Explain your changes.
This PR allows users to specify a
custom_range
of values to calculate partial depedency for some or all of thefeatures
.The api is
custom_range={feature: array-like of grid values}
.Any other comments?