8000 RFC ColumnTransformer input validation and requirements · Issue #14251 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
8000

RFC ColumnTransformer input validation and requirements #14251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Jul 3, 2019 · 32 comments · Fixed by #14544 or #19263
Closed

RFC ColumnTransformer input validation and requirements #14251

amueller opened this issue Jul 3, 2019 · 32 comments · Fixed by #14544 or #19263
Labels

Comments

@amueller
Copy link
Member
amueller commented Jul 3, 2019

There have been some issues around ColumnTransformer input requirements that I think we might want to discuss more explicitly. Examples are an actual bug when changing columns: #14237 and how to define number of input features #13603.
Related is also the idea of checking for column name consistency: #7242

Mainly I'm asking whether it makes sense to have the same requirements for ColumnTransformer as for other estimators.

ColumnTransformer is the only estimator that addresses columns by name, and so reordering columns doesn't impact the model and actually everything works fine. Should we still aim for checking for reordering of columns?

Right now, transform even works if we add additional columns to X, which makes sense, since it only cares about the columns it is using.

Should we allow that going forward as long as remainder is not used? (if remainder is used, the result of ColumnTransformer would have a different shape and downstream estimators would break, so I think we shouldn't support it in this case).

In either case, what's n_features_in_ for a ColumnTransformer? The number of columns in X during fit or the number of columns actually used?
Does it even make sense to define it if we allow adding other columns that are ignored?

@amueller
Copy link
Member Author
amueller commented Jul 3, 2019

Actually, whether adding a column works depends on how the columns were specified:

import pandas as pd
from sklearn.compose import make_column_transformer

df = pd.DataFrame({
  'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
  'salary': [103, 89, 142, 54, 63, 219],
  'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})

categorical = df.dtypes == object
preprocess = make_column_transformer(
    (StandardScaler(), ~categorical),
    (OneHotEncoder(), categorical))
preprocess.fit_transform(df)
df2 = df.copy()
df2['notused'] = 1
preprocess.transform(df2)
IndexingError  

And reordering the columns:

preprocess.transform(df.loc[:, ::-1])

passes through the ColumnTransformer but what is passed on is wrong.

but

categorical = ['boro', 'vegan']
continuous = ['salary']
preprocess = make_column_transformer(
    (StandardScaler(), continuous),
    (OneHotEncoder(), categorical))
preprocess.fit_transform(df)
df['notused'] = 1
preprocess.transform(df)

works, and reordering columns also works:

preprocess.transform(df.loc[:, ::-1)

@amueller
Copy link
Member Author
amueller commented Jul 3, 2019

Similarly, reordering columns should work if you use names for indexing, but not if you use boolean masks. That's ... interesting ... behavior, I would say.

@amueller
Copy link
Member Author
amueller commented Jul 3, 2019

Four approaches I could think of:

  1. Be strict and require that columns match exactly (with order). That would break code that currently works, but also might show some errors that currently hidden. Might be a bit inconvenient as it doesn't really allow subsetting columns flexibly.

  2. Be strict and require that columns match exactly unless they are specified by names. Then it should work on currently working use-cases, but error on things that are currently silent bugs.

  3. If the input is a pandas dataframe, always use the column names (from fit) to store the column identities, no matter how the user provided them originally, and allow reordering of columns and adding columns that are not used. Downside: behavior might change between when a numpy array is passed and when a dataframe is passed. I would need to think about this more.

  4. Keep things the way they are, i.e silent bug if reordering on booleans and integers and erroring on adding columns on booleans and integers and things working as expected using string names.

As I said above, if remainder is used we probably shouldn't allow adding extra columns for transform if we do (though we could also just ignore all columns not present during fit).

@schuderer
Copy link
Contributor
schuderer commented Jul 3, 2019

From a user with an (admittedly ancient) usability background who very recently tripped over their own assumptions on named columns and their implications:

  • I have/had a strong association of by-name implying flexible-ordering (and vice-versa)
  • At the same time, I was oblivious of the fact that this implies different rules for how columns specified by name vs by index were handled, particularly with remainder
  • I have a hard time remembering what works with one type of usage of the same parameter versus another, particularly in combination with other parameters

From this very subjective list, I would distill that:

  • clarity and consistency beat convenience

My main wish would be for the whole Pipeline/Transformer/Estimator API to be as consistent as possible in the choice of which of the rules that @amueller laid out above should be in effect. Option number 1 seems to match this the closest. I don't quite understand the part "...inconvenient as it doesn't really allow subsetting columns flexibly", however. Isn't it this flexibility (as I understand you mean between fit and transform) which only causes problems with other transformers? I can't see a practical use case for the flexibility to have more/fewer columns between fit and transform.

Reading my own ramblings (stream of consciousness indeed), I would not see a lot of harm for the end user in deprecating the ability to specify columns by name and only allow numeric indices/slices, because:

  • If the user starts out with column names, it's easy for them to create indices to pass to ColumnTransformer
  • Numeric indices intuitively imply fixed ordering, which I understand is a standard for other transformers anyway

The documentation could then refer users seeking more convenience to the contribution sklearn-pandas.

So this is more or less a 180 degree change from my initial suggestion, but I found myself using a lot of ifs even explaining the problem in the issue and pull request, which made me aware it might be possible to reduce complexity (externally and internally) a little bit.

@amueller
Copy link
Member Author
amueller commented Jul 3, 2019

Thank you for your input. I agree with most of your assessment, though not entirely with your conclusion. Ideally we'd get as much consistency and convenience as possible.
I also have a hard time wrapping my head around the current behavior, which is clearly not a great thing.

Adding extra columns during transform indeed would not be possible with any other transformer and would be a bad idea. However, if you think of ColumnTransformer going from "whatever was in the database / CSV" to something that's structured for scikit-learn, it makes sense to allow dropping columns from the test set that were not in the training set. Often the training set is collected in a different way then the test set and there might be extra columns in the test set that are not relevant to the model.
Clearly this could easily be fixed by dropping those before passing them into the ColumnTransformer, so it's not that big a deal.

I'm not sure why you wouldn't allow using strings for indexing. We could allow strings for indexing and still require the order to be fixed. This might not correspond entirely to your mental model (or mine) but I also don't see any real downside to allowing that if we test for consistency and have a good error message.

Generally I think it's desirable that the behavior is as consistent as possible between the different ways to specify columns, which is not what's currently the case.

@amueller
Copy link
Member Author
amueller commented Jul 3, 2019

Also: users will be annoyed if we forbid things that were allowed previously "just because" ;)

@schuderer
Copy link
Contributor
schuderer commented Jul 3, 2019

Good points. I did not really mean that clarity and convenience were a zero-sum-tradeoff to me (clarity begets convenience, the other way round... not so sure).

If ColumnTransformer wants to be used by users preparing raw data for scikit-learn ("pipeline ingestion adaptor"), as well as snugly inside a pipeline ("good pipeline citizen"), maybe it tries to be different things to different people? Not saying that it should be split up or anything, but this thought somehow stuck with me after re-reading the issue(s).

Maybe some sort of relax_pipeline_compatibility=False kwarg? (Yes, I know, "just make it an option", the epitome of lazy interface design -- the irony is not lost to me ;). But in this case, it would clean up a lot of those ifs at least in its default mode while it still could be used in "clean up this mess" mode if needed. Although my preference would be to let the user do this cleanup themselves)

Regarding not allowing strings for indexing: I suggest this because of (at least my) pretty strong intuitive understanding that by-name equals flexible ordering, and to keep other users from the same wrong assumption (admittedly subjective, would have to ask more users). Edit: reading comprehension. :) I guess it depends on how many users have this assumption and would have to be "corrected" by the doc/errors (if we end up correcting most of them, we might be the ones in need of correcting).

Regarding taking something away from users (relevant xkcd - of course!). True, but it's still so new that they might not have used it yet... ;)

@jnothman
Copy link
Member
jnothman commented Jul 3, 2019 via email

@adrinjalali
Copy link
Member

I quite like option one, especially since we can first warn the user of the change in the order and say we won't be accepting this soon.

Option 3 worries me cause it's an implicit behavior which the user might not understand the implications of.

I understand @amueller 's point on the difference between test and train sets, but I rather leave input validation to the user and not enter that realm in sklearn. That said, I'd still half-heartedly be okay with option 2.

@amueller
Copy link
Member Author

I'm fine with 1) if we do a deprecation cycle. I'm not sure it's convenient for users.
Right now we allow extra columns in the test set in some cases that are ignored. If we deprecate that and then reintroduce that it's a bit weird and inconvenient. So I'm not sure if doing 1 first and then doing 3 is good because we send weird signals to the user. But I guess it wouldn't be the end of the world?

@adrinjalali I'm not sure I understand what you mean by input validation.
Right now we have even more implicit behavior because the behavior in 3 is what we're doing right now if names are passed but not otherwise.

@amueller
Copy link
Member Author

Should we try and implement this now that the release is out? @adrinjalali do you want to take this? @NicolasHug ?
I can also give it a shot.

I think clarifying this will be useful on the way to feature names

@adrinjalali
Copy link
Member

I can take this @amueller

@NicolasHug
Copy link
Member

ColumnTransformer allows for negative indexing.
With #14237 merged, we still allow transform to have more features than fit.

So this creates a silent wrong behaviour:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
ct = ColumnTransformer([('sc', sc, [-1])])

X = [[1, 2],
     [3, 4]]
ct.fit(X)
ct.transform(X)  # passes [2, 4], OK
X = [[1, 2, 5],
     [3, 4, 6]]
ct.transform(X)  # passes [5, 6], bad

I'd go simple and error if transform doesn't get the same number of feature as fit.

@rth
Copy link
Member
rth commented Jun 25, 2020

I keep running into the fact that with remainder="drop" and named columns, transform requires same columns or more than fit even when they are not used (also #15085). I think it would be worth fixing this, even it if means special casing the all named columns case.

ValueError: Number of features of the input must be equal to or greater than that of the fitted transformer. Transformer n_features is 10 and input n_features is 7.

The use case is either,

  • during experimentation when one can create temporary columns in the train set (that are anyway not used), and then transform would fail.
  • in production when the model was trained on some data, then later during the predictions the data schema changed slightly in columns not used by the model, making the prediction fail. The only workaround for this is to pre-filter columns before sending them to the ColumnTransformer which is redundant with remainder="drop".

@rth
Copy link
Member
rth commented Jun 25, 2020

Filtered columns on train now getting,

ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword

it is quite annoying, because it means that the promise that a pipeline is a end to end solution between raw data and predictions doesn't hold. Now you need to prefilter/order columns in a consistent way both during train and predict. Re-opening this.

@rth rth reopened this Jun 25, 2020
@rth
Copy link
Member
rth commented Jun 25, 2020

BTW, some of the complexity is due to the fact that we are supporting both positional and named based indexing for dataframes. One could argue that ColumnTransformer should only support slices via getitem which would correspond to positional indexing for ndarray and named based indexing for dataframes. Though this would certainly raise other concerns though.

@adrinjalali
Copy link
Member

I think I definitely wouldn't want to do that in ColumnTransformer. I'd be happy to have a ColumnSelector which would do some loose validation and select the columns and return the same object type as the one given to it. But since it's quite different from the other transformers we have, I rather have that either as an example or something in sklearn-extra, WDYT?

@rth
Copy link
Member
rth commented Jun 25, 2020

Do you mean named only columns for dataframes or handling better remainder="drop" with named columns? For the former, it was just a though, I'm not particularly attached to it.

@adrinjalali
Copy link
Member

I mean, I'm happy with ColumnTransformer being very strict about the input, and having another transformer to handle loose kinda input.

@rth
Copy link
Member
rth commented Jun 25, 2020

OK, take a simple example,

df = pd.read_csv('some_data.txt')

df_train, df_test = train_test_split(df)

enc = ColumnTranformer([
   ('numeric', StandardScaler(), ['a', 'b', 'c']),
   ('cat', OneHotEncoder(), ['e', 'f'])
], remainder='drop'
)

pipe = make_pipeline([enc, LogisticRegression()])
pipe.fit(df_train, df_train['target'])
y_pred = pipe.predict(df_test)

so you would say, OK have a model that works let's serialize it and make predictions on new data. Except no, that won't work because new data doesn't have the 'target' column and so this pipeline would fail once deployed. Yes, it's possible to fit on df_train.drop('tagret', axis=1) but it's just an another example illustrating the issue.

Now as a user I don't understand why I would have to do something like,

colum_selector = ColumnSelector(select=['a', 'b', 'c', 'e', 'f'])
pipe = make_pipeline([colum_selector, enc, LogisticRegression()])

and keep the list of columns in sync, particularly if it means installing another package (scikit-learn-contib). It's just a very simple use case, and in real word situation the dataframe predictions are made on and the dataframe used for training can have slightly different columns and order (maybe there is just some other extra column used by another system). As long as access is done based on names, I don't see why the order need to match exactly. I mean I understand the implementation constraints due to supporting other types of slicing, but as a user I find this very unfortunate.

@adrinjalali
Copy link
Member

I'm saying the user should take care of what the data exactly looks like when they feed it to an sklearn pipeline. I just don't think it's in the scope of sklearn to handle random and varying input shapes and columns, and I think doing so would result in silent bugs. I would very much rather have users write something like:

df = pd.read_csv('some_data.txt')
columns = get_persisted_relevant_columns()
df = df[columns]
...

I think of not doing so as bad practice, and I'd discourage people from having such patterns since it makes their code fail with odd errors if the input changes.

@rth
Copy link
Member
rth commented Jun 25, 2020

I'm saying the user should take care of what the data exactly looks like when they feed it to an sklearn pipeline.

I disagree. For me that's exactly what ColumnTransformer is for: i.e. select some columns from the data which can have plenty irrelevant columns and extract features from them. As long as the columns used by the model are there, it should work. IMO the point of a pipeline is to take raw input and return the prediction. Having additional column ordering/selection step outside of the pipeline is a problem. Having a new column selection step inside the pipeline is also annoying since ColumnTransformer already has all the information to do it.

I think doing so would result in silent bugs.

For other estimators I agree. In my mind ColumnTransform is the only one that should correctly handle named columns irrespective of order or unused colums. What kind of bugs you foresee? If the column is there, it would make the calculation it it isn't it would fail. I get that things are getting a bit more nuanced for integer indexing or boolean mask indexing but in that case the current situation is fine. I imagine most users still apply ColumnTransformer with named columns.

Let's see what others think.

@jnothman
Copy link
Member
jnothman commented Jun 25, 2020 via email

@thomasjpfan
Copy link
Member

From a user point of view, I think having more flexibility with named columns would be nice.

@mxdev88
Copy link
mxdev88 commented Nov 1, 2020

Flexibility for named columns would indeed but extremely useful. I would expect an error only if I'm trying to transform X with missing column(s) during fit. For example, I would like to have several versions of a predicting model. Overtime I will make the model "better" (hopefully) by adding new features in. Enforcing strict column order or size means this is not possible but also means I need to pre-process the preprocess and keep a separate mapping of fields used for each model over time?
I understand this is an issue for positional columns though.

@adrinjalali
Copy link
Member

What kind of bugs you foresee? If the column is there, it would make the calculation it it isn't it would fail.

Not when we have a remaining option. I'd be okay with a SelectColumnsTransformer which would select the columns as a first step in a pipeline, and output a DataFrame, but I rather not include the complexity in ColumnTransformer.

@hmonteiro
Copy link
hmonteiro commented Nov 13, 2020

I was also bit by this. I also am of the opinion that if a column is not used at fit time (i.e. it's nor part of any selection and remainder is set to drop) it should not fail at transform time. In fact, if the input is a dataframe and all the column selections are done based on column names and not position indexes, then it doesn't even matter what the value of remainder is, as we could just save which column names were used at fit time.

@lorentzenchr
Copy link
Member

What is the status (or roadmap) of this discussion. I'm surprised to see that enforcement of strict column order was merged in #14544, although the discussion in this issues hints at no consensus on that. From a practical point of view, I agree 100% with every point made by @rth.

@adrinjalali
Copy link
Member

I guess my worry is the amount of complexity we're adding to ColumnTransformer. I would rather have choosing columns from a random input data in a different module/class.

@lorentzenchr
Copy link
Member
lorentzenchr commented Jan 13, 2021

My worry is that enforcing column order in ColumnTransformer, if columns are specified by name, makes our only official gateway for dataframe input effectively unusable. And - if I may ask - what for?

The problem with a possible ColumnSelector is that, at the current state, we wouldn't be able to use column names in a subsequent ColumnTransformer. It would also add - in my point of view - a completely unnecessary step.

In my view, column names are an API:

  • names select by, well, names - no matter the physical order of columns.
  • integers select by position - logical or physical.

@rth
Copy link
Member
rth commented Jan 14, 2021

names select by, well, names - no matter the physical order of columns.

I agree that column selection API that requires both have the right column name and an exact position is overly constraining and not very usable.

@glemaitre @jeremiedbb @thomasjpfan You approved #18256 at the time. Could you please comment on the above discussion? Thanks!

@lorentzenchr
Copy link
Member
lorentzenchr commented Aug 26, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
10 participants
0