8000 Add custom_range argument for partial dependence by freddyaboulton · Pull Request #21033 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add custom_range argument for partial dependence #21033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

freddyaboulton
Copy link
Contributor
@freddyaboulton freddyaboulton commented Sep 13, 2021

Reference Issues/PRs

Fixes #20890

What does this implement/fix? Explain your changes.

This PR allows users to specify a custom_range of values to calculate partial depedency for some or all of the features.

The api is custom_range={feature: array-like of grid values}.

Any other comments?

@freddyaboulton
Copy link
Contributor Author

For whoever reviews this PR, where in the changelog I should add this change?

@thomasjpfan
Copy link
Member

For whoever reviews this PR, where in the changelog I should add this change?

doc/whats_new/v1.1.rst

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for starting this PR @freddyaboulton !

I am open to a better name than custom_grid, the dictionary's value is more like a range.

Comment on lines 375 to 376
the partial dependence should be calculated for that feature. The length
of `custom_grid` must match the length of `features`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the requirement that len(custom_grid) == len(features) is not a great user experience.

I was thinking of extending _grid_from_X to support custom_grid, and if custom_grid is provide for a given feature, then do minimal validation and add it to values.

@freddyaboulton freddyaboulton changed the title Add custom_grid argument for partial dependence Add custom_range argument for partial dependence Sep 15, 2021
@freddyaboulton
Copy link
Contributor Author

@thomasjpfan Thank you for the feedback! I changed the name from custom_grid to custom_range and changed the implementation so that custom_range can include a subset of the features.

@freddyaboulton freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch from 2f96fc0 to f2195ce Compare September 17, 2021 16:56
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the follow up! I made another pass through the PR.

values.append(axis)

return cartesian(values), values
# Store cartesian in grid of dtype=object to support grids of str/numeric values
shape = (len(v) for v in values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would increase memory overhead for a grid that is all numerical. We need to use objects only when it is necessary, otherwise numerical ndarrays would be good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! If there are mixed types (numeric/non-numeric) we use object, else use the first dtype (default behavior of cartesian)

@@ -321,6 +344,11 @@ def partial_dependence(
`kind='average'` will be the new default. It is intended to migrate
from the ndarray output to :class:`~sklearn.utils.Bunch` output.

custom_range: dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
custom_range: dict
custom_range: dict, default=None

Given the language of "values" in the docstring, maybe custom_values would be a better name. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 521 to 497
if isinstance(features, (str, int, float, bool)):
features = [features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think features can be a single bool or a float:

Suggested change
if isinstance(features, (str, int, float, bool)):
features = [features]
if isinstance(features, (str, int)):
features = [features]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Will make this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

custom_range = custom_range or {}
if isinstance(features, (str, int, float, bool)):
features = [features]
custom_range = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not override custom_range while also reading custom_range:

Suggested change
custom_range = {
custom_range_idx = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if isinstance(features, (str, int, float, bool)):
features = [features]
custom_range = {
index: custom_range.get(feature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this works when feature is a mask. I do not think we officially support mask given that the docstring does not mention it.

@glemaitre Was mask support intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works with boolean, it was not intentional as you mentioned. It is not tested at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean mask is tested here:

([True, False, True, False], (3, 10, 10)),

Copy link
Contributor Author
@freddyaboulton freddyaboulton Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out @thomasjpfan !

I can change the api so that custom_range (or custom_values as we will call it now) is a list of array-like rather than a dictionary. The implementation will assume (and the docstring will make clear) that the order of values in custom_values will correspond to the order of the features argument.

That way custom_values supports all types of features argument. lDoes that sound good?

Perhaps we should also file a separate issue for making clear that boolean masks are allowed given that it's tested and supported but not documented?

Copy link
Member
@thomasjpfan thomasjpfan Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should also file a separate issue for making clear that boolean masks are allowed given that it's tested and supported but not documented?

In this case, I think we follow the docs and that boolean support was by unintentional.

I can change the api so that custom_range (or custom_values as we will call it now) is a list of array-like rather than a dictionary.

I think that an array-like UX would not be great. Imagine having 20 features and wanting to specific the range for feature 10. custom_values would look like:

custom_values = [None, None, ..., None, [0, 1, 2, 3, 4], None, ..., None]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a custom_values argument will solve different problems - I just wanted to ask for this feature. Great it is almost available! I think a dict would be a good interface to pass the evaluation points. The keys would be a subset of features (and of the corresponding type).

So, e.g.,

from plotnine.data import diamonds

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.inspection import PartialDependenceDisplay

# Ordered categoricals to be integer encoded in correct order
ord_vars = ["color", "cut", "clarity"]
ord_levels = [
    ['D', 'E', 'F', 'G', 'H', 'I', 'J'],
    ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],
    ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
]

# Modeling
model = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("linear", "passthrough", ["carat"]),
            ("ordered", OrdinalEncoder(categories=ord_levels), ord_vars)
        ],
    ),
    RandomForestRegressor(max_features="sqrt", min_samples_leaf=5)
)

model.fit(diamonds, y=diamonds["price"]);

# Does not work yet
PartialDependenceDisplay.from_estimator(
    model,
    diamonds, 
    n_jobs=8,
    features=["carat"] + ord_vars,
    custom_values=dict(zip(ord_vars, ord_levels))
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan @mayer79 Thank you for the comments and feedback. I will keep the api as is then since the boolean mask is not officially supported!

@freddyaboulton freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch 2 times, most recently from 9ffac83 to 701c868 Compare October 27, 2021 16:04
@freddyaboulton
Copy link
Contributor Author

@thomasjpfan I forgot to mention I addressed the latest batch of comments!

Copy link
Contributor
@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool work! Left a question on whether custom_values can default to an empty dictionary

@@ -58,6 +58,10 @@ def _grid_from_X(X, percentiles, grid_resolution):
The number of equally spaced points to be placed on the grid for each
feature.

custom_values: dict
Mapping from column index of X to an array-like of values where
the partial dependence should be calculated for that feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: period here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Mapping from column index of X" isn't accurate, it's actually mapping the element number from the list of user-provided features. Not sure if this will have downstream effects.
i.e. If user specifies that they want PDP for feature 5 only, the key generated in the partial_dependence function will assign this feature as 0, not 5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring is accurate. At this point, X has been subset to the features in the features array with _safe_indexing and the custom_values array is modified so that it maps the column index in the subset array to the array of values.

@@ -36,7 +36,7 @@
]


def _grid_from_X(X, percentiles, grid_resolution):
def _grid_from_X(X, percentiles, grid_resolution, custom_values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to default custom_values as an empty dict? This avoids the case where a value needs to be passed in to this as an argument when it isn't necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it needs to default to None, then be replaced by an empty dictionary if it's None.

Copy link
Contributor
@InterferencePattern InterferencePattern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great feature you've added to PDP. Look forward to getting it nailed down and released!

_safe_indexing(X, feature, axis=1), prob=percentiles, axis=0
)
if np.allclose(emp_percentiles[0], emp_percentiles[1]):
if feature in custom_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to rename feature to feature_idx since it's not the feature name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom_values is also index-based, and perhaps the name should reflect that (since custom_values is also an argument in the partial_dependence function, and has a different structure.

@@ -58,6 +58,10 @@ def _grid_from_X(X, percentiles, grid_resolution):
The number of equally spaced points to be placed on the grid for each
feature.

custom_values: dict
Mapping from column index of X to an array-like of values where
the partial dependence should be calculated for that feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Mapping from column index of X" isn't accurate, it's actually mapping the element number from the list of user-provided features. Not sure if this will have downstream effects.
i.e. If user specifies that they want PDP for feature 5 only, the key generated in the partial_dependence function will assign this feature as 0, not 5.

@InterferencePattern
Copy link
Contributor
InterferencePattern commented Jan 31, 2022

I have another question- does this work with PartialDependenceDisplay object's required deciles attribute?

EDIT: A better way to ask this is whether or not anyone has managed to plot the results from this function with either plot_partial_dependence or PartialDependenceDisplay.

return cartesian(values), values
# Create a place holder for the cartesian product of the individual grids.
shape = (len(v) for v in values)
ix = np.indices(shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a bad idea to create an array with a new dimension for every single feature, because this line will fail for >32 features.

Traceback (most recent call last):
  File ".../lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 521, in partial_dependence
    custom_values_idx
  File ".../lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 120, in _grid_from_X
    ix = np.indices(shape)
  File ".../lib/python3.7/site-packages/numpy/core/numeric.py", line 1777, in indices
    res = empty((N,)+dimensions, dtype=dtype)
ValueError: maximum supported dimension for an ndarray is 32, found 792

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct but that limitation exists on the current main branch

from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import partial_dependence
import pytest

data = load_digits()
X = data.data
y = data.target

rf = RandomForestClassifier()
rf.fit(X, y)

with pytest.raises(ValueError, match="maximum supported dimension for an ndarray is 32, found 33"):
    partial_dependence(rf, X, features=list(range(32)))

So if we want to support more than 32-way partial dependence, that's a separate ticket.

@freddyaboulton freddyaboulton force-pushed the 20890-partial-dependence-custom-grid branch from 1f9d38d to 778203a Compare February 18, 2022 19:37
@freddyaboulton
Copy link
Contributor Author

@thomasjpfan Can you please give this another look when you get a chance? Thanks!

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

if use_custom_values:
custom_values = {
"age": custom_values_helper(age, grid_resolution),
"bmi": custom_values_helper(bmi, grid_resolution),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_plot_partial_dependence.py is already one of the longer duration test files, I would prefer not to parametrize everything with use_custom_values.

I think the test_partial_dependence_pipeline_custom_values and test_grid_from_X covers the new code well.

# numeric/non-numeric features we use object dtype.
# Else, we use the first dtype in values,
# which is the default behavior of the cartesian function.
dtypes = [arr.dtype for arr in values]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know your original issue is about inputting strings, but I think we can reduce the scope of this PR by supporting numerical custom_values first and then add in object/str support later.

(Reducing the scope makes it easier to review and usually results in merging faster)

Comment on lines +89 to +91
feature_range = custom_values[feature]
if not isinstance(feature_range, np.ndarray):
feature_range = np.array(feature_range)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.asarray is a noop when the input is an ndarray:

Suggested change
feature_range = custom_values[feature]
if not isinstance(feature_range, np.ndarray):
feature_range = np.array(feature_range)
feature_range = np.asarray(custom_values[feature])

(while np.array will always make a copy)

"Grid for feature {} is not a one-dimensional array. Got {}"
" dimensions".format(feature, feature_range.ndim)
)
axis = feature_range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll prefer to error check before hand the loop since checking custom_values is more of an input validation: (It also reduces the code that is indented:

custom_values = {k: np.asarray(v) for k, v in custom_values.items()}
if any(v.ndim != 1 for v in custom_values.values()):
    raise ValueError(...)

for feature in range(X.shape[1]):
    if feature in custom_values:
        axis = custom_values[feature]
    else:
        ...

In the erorr message, I think it's enough to say that custom_values contain non-1d data. The feature index does not correspond to the location of the original X anymore so it can be confusing.

Also, we can add a new test to test_grid_from_X_error to check the custom_values validation

Comment on lines +504 to +508
custom_values_idx = {
index: custom_values.get(feature)
for index, feature in enumerate(features)
if feature in custom_values
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fairly unclear that custom_values_idx maps from the sliced X. I think we can make it more clear by doing:

X_subset = _safe_indexing(X, features_indices, axis=1)

custom_values = custom_values or {}
custom_values_for_X_subset = {...}
grid, values = _grid_from_X(X_subset, ...)

@stephenpardy
Copy link
Contributor

@freddyaboulton, @thomasjpfan I am also interested in this feature. There hasn't been any activity on this PR in over a year and I would be interested in taking it on.

@stephenpardy
Copy link
Contributor

@thomasjpfan can we switch the source to my branch - https://github.com/stephenpardy/scikit-learn/tree/20890-partial-dependence-custom-grid? I talked to Freddy and I will be taking over the PR - I will address the issues brought up in the last comments.

@thomasjpfan
Copy link
Member

@freddyaboulton Can you confirm that you are okay with @stephenpardy continuing the work on this PR?

@stephenpardy It's best to open another PR that references this one and states that it is superseding this PR. The new PR's opening commit should still link back to the original issue and describe its updates.

@freddyaboulton
Copy link
Contributor Author

Hi @thomasjpfan ! Yes @stephenpardy can take over

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow users to pass their own grid to partial dependence
8 participants
0