-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
RFC ColumnTransformer input validation and requirements #14251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Actually, whether adding a column works depends on how the columns were specified: import pandas as pd
from sklearn.compose import make_column_transformer
df = pd.DataFrame({
'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
'salary': [103, 89, 142, 54, 63, 219],
'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})
categorical = df.dtypes == object
preprocess = make_column_transformer(
(StandardScaler(), ~categorical),
(OneHotEncoder(), categorical))
preprocess.fit_transform(df)
df2 = df.copy()
df2['notused'] = 1
preprocess.transform(df2)
And reordering the columns: preprocess.transform(df.loc[:, ::-1]) passes through the ColumnTransformer but what is passed on is wrong. but categorical = ['boro', 'vegan']
continuous = ['salary']
preprocess = make_column_transformer(
(StandardScaler(), continuous),
(OneHotEncoder(), categorical))
preprocess.fit_transform(df)
df['notused'] = 1
preprocess.transform(df) works, and reordering columns also works: preprocess.transform(df.loc[:, ::-1) |
Similarly, reordering columns should work if you use names for indexing, but not if you use boolean masks. That's ... interesting ... behavior, I would say. |
Four approaches I could think of:
As I said above, if |
From a user with an (admittedly ancient) usability background who very recently tripped over their own assumptions on named columns and their implications:
From this very subjective list, I would distill that:
My main wish would be for the whole Pipeline/Transformer/Estimator API to be as consistent as possible in the choice of which of the rules that @amueller laid out above should be in effect. Option number 1 seems to match this the closest. I don't quite understand the part "...inconvenient as it doesn't really allow subsetting columns flexibly", however. Isn't it this flexibility (as I understand you mean between fit and transform) which only causes problems with other transformers? I can't see a practical use case for the flexibility to have more/fewer columns between fit and transform. Reading my own ramblings (stream of consciousness indeed), I would not see a lot of harm for the end user in deprecating the ability to specify columns by name and only allow numeric indices/slices, because:
The documentation could then refer users seeking more convenience to the contribution sklearn-pandas. So this is more or less a 180 degree change from my initial suggestion, but I found myself using a lot of |
Thank you for your input. I agree with most of your assessment, though not entirely with your conclusion. Ideally we'd get as much consistency and convenience as possible. Adding extra columns during transform indeed would not be possible with any other transformer and would be a bad idea. However, if you think of ColumnTransformer going from "whatever was in the database / CSV" to something that's structured for scikit-learn, it makes sense to allow dropping columns from the test set that were not in the training set. Often the training set is collected in a different way then the test set and there might be extra columns in the test set that are not relevant to the model. I'm not sure why you wouldn't allow using strings for indexing. We could allow strings for indexing and still require the order to be fixed. This might not correspond entirely to your mental model (or mine) but I also don't see any real downside to allowing that if we test for consistency and have a good error message. Generally I think it's desirable that the behavior is as consistent as possible between the different ways to specify columns, which is not what's currently the case. |
Also: users will be annoyed if we forbid things that were allowed previously "just because" ;) |
Good points. I did not really mean that clarity and convenience were a zero-sum-tradeoff to me (clarity begets convenience, the other way round... not so sure). If ColumnTransformer wants to be used by users preparing raw data for scikit-learn ("pipeline ingestion adaptor"), as well as snugly inside a pipeline ("good pipeline citizen"), maybe it tries to be different things to different people? Not saying that it should be split up or anything, but this thought somehow stuck with me after re-reading the issue(s). Maybe some sort of Regarding not allowing strings for indexing: Regarding taking something away from users (relevant xkcd - of course!). True, but it's still so new that they might not have used it yet... ;) |
I am happy with 1 as an interim solution at least. I would also consider
allowing appended columns in a DataFrame. I would also consider 3 for the
future but as you point out we would be making a new requirement that you
need to transform a DataFrame if you fit a DataFrame
|
I quite like option one, especially since we can first warn the user of the change in the order and say we won't be accepting this soon. Option 3 worries me cause it's an implicit behavior which the user might not understand the implications of. I understand @amueller 's point on the difference between test and train sets, but I rather leave input validation to the user and not enter that realm in sklearn. That said, I'd still half-heartedly be okay with option 2. |
I'm fine with 1) if we do a deprecation cycle. I'm not sure it's convenient for users. @adrinjalali I'm not sure I understand what you mean by input validation. |
Should we try and implement this now that the release is out? @adrinjalali do you want to take this? @NicolasHug ? I think clarifying this will be useful on the way to feature names |
I can take this @amueller |
So this creates a silent wrong behaviour: from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
ct = ColumnTransformer([('sc', sc, [-1])])
X = [[1, 2],
[3, 4]]
ct.fit(X)
ct.transform(X) # passes [2, 4], OK
X = [[1, 2, 5],
[3, 4, 6]]
ct.transform(X) # passes [5, 6], bad I'd go simple and error if |
I keep running into the fact that with
The use case is either,
|
Filtered columns on train now getting,
it is quite annoying, because it means that the promise that a pipeline is a end to end solution between raw data and predictions doesn't hold. Now you need to prefilter/order columns in a consistent way both during train and predict. Re-opening this. |
BTW, some of the complexity is due to the fact that we are supporting both positional and named based indexing for dataframes. One could argue that ColumnTransformer should only support slices via getitem which would correspond to positional indexing for ndarray and named based indexing for dataframes. Though this would certainly raise other concerns though. |
I think I definitely wouldn't want to do that in |
Do you mean named only columns for dataframes or handling better remainder="drop" with named columns? For the former, it was just a though, I'm not particularly attached to it. |
I mean, I'm happy with |
OK, take a simple example, df = pd.read_csv('some_data.txt')
df_train, df_test = train_test_split(df)
enc = ColumnTranformer([
('numeric', StandardScaler(), ['a', 'b', 'c']),
('cat', OneHotEncoder(), ['e', 'f'])
], remainder='drop'
)
pipe = make_pipeline([enc, LogisticRegression()])
pipe.fit(df_train, df_train['target'])
y_pred = pipe.predict(df_test) so you would say, OK have a model that works let's serialize it and make predictions on new data. Except no, that won't work because new data doesn't have the Now as a user I don't understand why I would have to do something like, colum_selector = ColumnSelector(select=['a', 'b', 'c', 'e', 'f'])
pipe = make_pipeline([colum_selector, enc, LogisticRegression()]) and keep the list of columns in sync, particularly if it means installing another package (scikit-learn-contib). It's just a very simple use case, and in real word situation the dataframe predictions are made on and the dataframe used for training can have slightly different columns and order (maybe there is just some other extra column used by another system). As long as access is done based on names, I don't see why the order need to match exactly. I mean I understand the implementation constraints due to supporting other types of slicing, but as a user I find this very unfortunate. |
I'm saying the user should take care of what the data exactly looks like when they feed it to an sklearn pipeline. I just don't think it's in the scope of sklearn to handle random and varying input shapes and columns, and I think doing so would result in silent bugs. I would very much rather have users write something like: df = pd.read_csv('some_data.txt')
columns = get_persisted_relevant_columns()
df = df[columns]
... I think of not doing so as bad practice, and I'd discourage people from having such patterns since it makes their code fail with odd errors if the input changes. |
I disagree. For me that's exactly what
For other estimators I agree. In my mind ColumnTransform is the only one that should correctly handle named columns irrespective of order or unused colums. What kind of bugs you foresee? If the column is there, it would make the calculation it it isn't it would fail. I get that things are getting a bit more nuanced for integer indexing or boolean mask indexing but in that case the current situation is fine. I imagine most users still apply ColumnTransformer with named columns. Let's see what others think. |
It seems like ColumnTransformer is different things to different people!
I don't mind it having a bit more flexbilility in input, if named columns
are always used.
… |
From a user point of view, I think having more flexibility with named columns would be nice. |
Flexibility for named columns would indeed but extremely useful. I would expect an error only if I'm trying to transform X with missing column(s) during fit. For example, I would like to have several versions of a predicting model. Overtime I will make the model "better" (hopefully) by adding new features in. Enforcing strict column order or size means this is not possible but also means I need to pre-process the preprocess and keep a separate mapping of fields used for each model over time? |
Not when we have a |
I was also bit by this. I also am of the opinion that if a column is not used at fit time (i.e. it's nor part of any selection and remainder is set to |
I guess my worry is the amount of complexity we're adding to |
My worry is that enforcing column order in The problem with a possible In my view, column names are an API:
|
I agree that column selection API that requires both have the right column name and an exact position is overly constraining and not very usable. @glemaitre @jeremiedbb @thomasjpfan You approved #18256 at the time. Could you please comment on the above discussion? Thanks! |
Discussion was resolved in https://github.com/scikit-learn/administrative/blob/master/monthly_meetings/2021-01-25.md. |
There have been some issues around ColumnTransformer input requirements that I think we might want to discuss more explicitly. Examples are an actual bug when changing columns: #14237 and how to define number of input features #13603.
Related is also the idea of checking for column name consistency: #7242
Mainly I'm asking whether it makes sense to have the same requirements for ColumnTransformer as for other estimators.
ColumnTransformer is the only estimator that addresses columns by name, and so reordering columns doesn't impact the model and actually everything works fine. Should we still aim for checking for reordering of columns?
Right now,
transform
even works if we add additional columns toX
, which makes sense, since it only cares about the columns it is using.Should we allow that going forward as long as remainder is not used? (if remainder is used, the result of
ColumnTransformer
would have a different shape and downstream estimators would break, so I think we shouldn't support it in this case).In either case, what's
n_features_in_
for a ColumnTransformer? The number of columns inX
duringfit
or the number of columns actually used?Does it even make sense to define it if we allow adding other columns that are ignored?
The text was updated successfully, but these errors were encountered: