RFC ColumnTransformer input validation and requirements #14251

amueller · 2019-07-03T17:56:49Z

There have been some issues around ColumnTransformer input requirements that I think we might want to discuss more explicitly. Examples are an actual bug when changing columns: #14237 and how to define number of input features #13603.
Related is also the idea of checking for column name consistency: #7242

Mainly I'm asking whether it makes sense to have the same requirements for ColumnTransformer as for other estimators.

ColumnTransformer is the only estimator that addresses columns by name, and so reordering columns doesn't impact the model and actually everything works fine. Should we still aim for checking for reordering of columns?

Right now, transform even works if we add additional columns to X, which makes sense, since it only cares about the columns it is using.

Should we allow that going forward as long as remainder is not used? (if remainder is used, the result of ColumnTransformer would have a different shape and downstream estimators would break, so I think we shouldn't support it in this case).

In either case, what's n_features_in_ for a ColumnTransformer? The number of columns in X during fit or the number of columns actually used?
Does it even make sense to define it if we allow adding other columns that are ignored?

The text was updated successfully, but these errors were encountered:

amueller · 2019-07-03T17:59:55Z

Actually, whether adding a column works depends on how the columns were specified:

import pandas as pd
from sklearn.compose import make_column_transformer

df = pd.DataFrame({
  'boro': ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
  'salary': [103, 89, 142, 54, 63, 219],
  'vegan': ['No', 'No','No','Yes', 'Yes', 'No']})

categorical = df.dtypes == object
preprocess = make_column_transformer(
    (StandardScaler(), ~categorical),
    (OneHotEncoder(), categorical))
preprocess.fit_transform(df)
df2 = df.copy()
df2['notused'] = 1
preprocess.transform(df2)

IndexingError

And reordering the columns:

preprocess.transform(df.loc[:, ::-1])

passes through the ColumnTransformer but what is passed on is wrong.

but

categorical = ['boro', 'vegan']
continuous = ['salary']
preprocess = make_column_transformer(
    (StandardScaler(), continuous),
    (OneHotEncoder(), categorical))
preprocess.fit_transform(df)
df['notused'] = 1
preprocess.transform(df)

works, and reordering columns also works:

preprocess.transform(df.loc[:, ::-1)

amueller · 2019-07-03T18:01:44Z

Similarly, reordering columns should work if you use names for indexing, but not if you use boolean masks. That's ... interesting ... behavior, I would say.

amueller · 2019-07-03T18:09:30Z

Four approaches I could think of:

Be strict and require that columns match exactly (with order). That would break code that currently works, but also might show some errors that currently hidden. Might be a bit inconvenient as it doesn't really allow subsetting columns flexibly.
Be strict and require that columns match exactly unless they are specified by names. Then it should work on currently working use-cases, but error on things that are currently silent bugs.
If the input is a pandas dataframe, always use the column names (from fit) to store the column identities, no matter how the user provided them originally, and allow reordering of columns and adding columns that are not used. Downside: behavior might change between when a numpy array is passed and when a dataframe is passed. I would need to think about this more.
Keep things the way they are, i.e silent bug if reordering on booleans and integers and erroring on adding columns on booleans and integers and things working as expected using string names.

As I said above, if remainder is used we probably shouldn't allow adding extra columns for transform if we do (though we could also just ignore all columns not present during fit).

schuderer · 2019-07-03T19:20:13Z

From a user with an (admittedly ancient) usability background who very recently tripped over their own assumptions on named columns and their implications:

I have/had a strong association of by-name implying flexible-ordering (and vice-versa)
At the same time, I was oblivious of the fact that this implies different rules for how columns specified by name vs by index were handled, particularly with remainder
I have a hard time remembering what works with one type of usage of the same parameter versus another, particularly in combination with other parameters

From this very subjective list, I would distill that:

clarity and consistency beat convenience

My main wish would be for the whole Pipeline/Transformer/Estimator API to be as consistent as possible in the choice of which of the rules that @amueller laid out above should be in effect. Option number 1 seems to match this the closest. I don't quite understand the part "...inconvenient as it doesn't really allow subsetting columns flexibly", however. Isn't it this flexibility (as I understand you mean between fit and transform) which only causes problems with other transformers? I can't see a practical use case for the flexibility to have more/fewer columns between fit and transform.

Reading my own ramblings (stream of consciousness indeed), I would not see a lot of harm for the end user in deprecating the ability to specify columns by name and only allow numeric indices/slices, because:

If the user starts out with column names, it's easy for them to create indices to pass to ColumnTransformer
Numeric indices intuitively imply fixed ordering, which I understand is a standard for other transformers anyway

The documentation could then refer users seeking more convenience to the contribution sklearn-pandas.

So this is more or less a 180 degree change from my initial suggestion, but I found myself using a lot of ifs even explaining the problem in the issue and pull request, which made me aware it might be possible to reduce complexity (externally and internally) a little bit.

amueller · 2019-07-03T19:29:37Z

Thank you for your input. I agree with most of your assessment, though not entirely with your conclusion. Ideally we'd get as much consistency and convenience as possible.
I also have a hard time wrapping my head around the current behavior, which is clearly not a great thing.

Adding extra columns during transform indeed would not be possible with any other transformer and would be a bad idea. However, if you think of ColumnTransformer going from "whatever was in the database / CSV" to something that's structured for scikit-learn, it makes sense to allow dropping columns from the test set that were not in the training set. Often the training set is collected in a different way then the test set and there might be extra columns in the test set that are not relevant to the model.
Clearly this could easily be fixed by dropping those before passing them into the ColumnTransformer, so it's not that big a deal.

I'm not sure why you wouldn't allow using strings for indexing. We could allow strings for indexing and still require the order to be fixed. This might not correspond entirely to your mental model (or mine) but I also don't see any real downside to allowing that if we test for consistency and have a good error message.

Generally I think it's desirable that the behavior is as consistent as possible between the different ways to specify columns, which is not what's currently the case.

amueller · 2019-07-03T19:31:01Z

Also: users will be annoyed if we forbid things that were allowed previously "just because" ;)

schuderer · 2019-07-03T20:09:20Z

Good points. I did not really mean that clarity and convenience were a zero-sum-tradeoff to me (clarity begets convenience, the other way round... not so sure).

If ColumnTransformer wants to be used by users preparing raw data for scikit-learn ("pipeline ingestion adaptor"), as well as snugly inside a pipeline ("good pipeline citizen"), maybe it tries to be different things to different people? Not saying that it should be split up or anything, but this thought somehow stuck with me after re-reading the issue(s).

Maybe some sort of relax_pipeline_compatibility=False kwarg? (Yes, I know, "just make it an option", the epitome of lazy interface design -- the irony is not lost to me ;). But in this case, it would clean up a lot of those ifs at least in its default mode while it still could be used in "clean up this mess" mode if needed. Although my preference would be to let the user do this cleanup themselves)

Regarding not allowing strings for indexing: I suggest this because of (at least my) pretty strong intuitive understanding that by-name equals flexible ordering, and to keep other users from the same wrong assumption (admittedly subjective, would have to ask more users). Edit: reading comprehension. :) I guess it depends on how many users have this assumption and would have to be "corrected" by the doc/errors (if we end up correcting most of them, we might be the ones in need of correcting).

Regarding taking something away from users (relevant xkcd - of course!). True, but it's still so new that they might not have used it yet... ;)

jnothman · 2019-07-03T22:33:59Z

I am happy with 1 as an interim solution at least. I would also consider allowing appended columns in a DataFrame. I would also consider 3 for the future but as you point out we would be making a new requirement that you need to transform a DataFrame if you fit a DataFrame

adrinjalali · 2019-07-09T08:36:19Z

I quite like option one, especially since we can first warn the user of the change in the order and say we won't be accepting this soon.

Option 3 worries me cause it's an implicit behavior which the user might not understand the implications of.

I understand @amueller 's point on the difference between test and train sets, but I rather leave input validation to the user and not enter that realm in sklearn. That said, I'd still half-heartedly be okay with option 2.

amueller · 2019-07-10T02:40:36Z

I'm fine with 1) if we do a deprecation cycle. I'm not sure it's convenient for users.
Right now we allow extra columns in the test set in some cases that are ignored. If we deprecate that and then reintroduce that it's a bit weird and inconvenient. So I'm not sure if doing 1 first and then doing 3 is good because we send weird signals to the user. But I guess it wouldn't be the end of the world?

@adrinjalali I'm not sure I understand what you mean by input validation.
Right now we have even more implicit behavior because the behavior in 3 is what we're doing right now if names are passed but not otherwise.

amueller · 2019-07-30T19:11:52Z

Should we try and implement this now that the release is out? @adrinjalali do you want to take this? @NicolasHug ?
I can also give it a shot.

I think clarifying this will be useful on the way to feature names

adrinjalali · 2019-07-31T16:30:16Z

I can take this @amueller

NicolasHug · 2019-08-02T15:33:43Z

ColumnTransformer allows for negative indexing.
With #14237 merged, we still allow transform to have more features than fit.

So this creates a silent wrong behaviour:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
ct = ColumnTransformer([('sc', sc, [-1])])

X = [[1, 2],
     [3, 4]]
ct.fit(X)
ct.transform(X)  # passes [2, 4], OK
X = [[1, 2, 5],
     [3, 4, 6]]
ct.transform(X)  # passes [5, 6], bad

I'd go simple and error if transform doesn't get the same number of feature as fit.

rth · 2020-06-25T08:40:33Z

I keep running into the fact that with remainder="drop" and named columns, transform requires same columns or more than fit even when they are not used (also #15085). I think it would be worth fixing this, even it if means special casing the all named columns case.

ValueError: Number of features of the input must be equal to or greater than that of the fitted transformer. Transformer n_features is 10 and input n_features is 7.

The use case is either,

during experimentation when one can create temporary columns in the train set (that are anyway not used), and then transform would fail.
in production when the model was trained on some data, then later during the predictions the data schema changed slightly in columns not used by the model, making the prediction fail. The only workaround for this is to pre-filter columns before sending them to the ColumnTransformer which is redundant with remainder="drop".

rth · 2020-06-25T09:06:04Z

Filtered columns on train now getting,

ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword

it is quite annoying, because it means that the promise that a pipeline is a end to end solution between raw data and predictions doesn't hold. Now you need to prefilter/order columns in a consistent way both during train and predict. Re-opening this.

rth · 2020-06-25T09:13:04Z

BTW, some of the complexity is due to the fact that we are supporting both positional and named based indexing for dataframes. One could argue that ColumnTransformer should only support slices via getitem which would correspond to positional indexing for ndarray and named based indexing for dataframes. Though this would certainly raise other concerns though.

adrinjalali · 2020-06-25T10:18:15Z

I think I definitely wouldn't want to do that in ColumnTransformer. I'd be happy to have a ColumnSelector which would do some loose validation and select the columns and return the same object type as the one given to it. But since it's quite different from the other transformers we have, I rather have that either as an example or something in sklearn-extra, WDYT?

rth · 2020-06-25T10:22:22Z

Do you mean named only columns for dataframes or handling better remainder="drop" with named columns? For the former, it was just a though, I'm not particularly attached to it.

adrinjalali · 2020-06-25T10:28:38Z

I mean, I'm happy with ColumnTransformer being very strict about the input, and having another transformer to handle loose kinda input.

rth · 2020-06-25T10:43:24Z

OK, take a simple example,

df = pd.read_csv('some_data.txt')

df_train, df_test = train_test_split(df)

enc = ColumnTranformer([
   ('numeric', StandardScaler(), ['a', 'b', 'c']),
   ('cat', OneHotEncoder(), ['e', 'f'])
], remainder='drop'
)

pipe = make_pipeline([enc, LogisticRegression()])
pipe.fit(df_train, df_train['target'])
y_pred = pipe.predict(df_test)

so you would say, OK have a model that works let's serialize it and make predictions on new data. Except no, that won't work because new data doesn't have the 'target' column and so this pipeline would fail once deployed. Yes, it's possible to fit on df_train.drop('tagret', axis=1) but it's just an another example illustrating the issue.

Now as a user I don't understand why I would have to do something like,

colum_selector = ColumnSelector(select=['a', 'b', 'c', 'e', 'f'])
pipe = make_pipeline([colum_selector, enc, LogisticRegression()])

and keep the list of columns in sync, particularly if it means installing another package (scikit-learn-contib). It's just a very simple use case, and in real word situation the dataframe predictions are made on and the dataframe used for training can have slightly different columns and order (maybe there is just some other extra column used by another system). As long as access is done based on names, I don't see why the order need to match exactly. I mean I understand the implementation constraints due to supporting other types of slicing, but as a user I find this very unfortunate.

adrinjalali · 2020-06-25T11:24:06Z

I'm saying the user should take care of what the data exactly looks like when they feed it to an sklearn pipeline. I just don't think it's in the scope of sklearn to handle random and varying input shapes and columns, and I think doing so would result in silent bugs. I would very much rather have users write something like:

df = pd.read_csv('some_data.txt')
columns = get_persisted_relevant_columns()
df = df[columns]
...

I think of not doing so as bad practice, and I'd discourage people from having such patterns since it makes their code fail with odd errors if the input changes.

rth · 2020-06-25T11:38:05Z

I'm saying the user should take care of what the data exactly looks like when they feed it to an sklearn pipeline.

I disagree. For me that's exactly what ColumnTransformer is for: i.e. select some columns from the data which can have plenty irrelevant columns and extract features from them. As long as the columns used by the model are there, it should work. IMO the point of a pipeline is to take raw input and return the prediction. Having additional column ordering/selection step outside of the pipeline is a problem. Having a new column selection step inside the pipeline is also annoying since ColumnTransformer already has all the information to do it.

I think doing so would result in silent bugs.

For other estimators I agree. In my mind ColumnTransform is the only one that should correctly handle named columns irrespective of order or unused colums. What kind of bugs you foresee? If the column is there, it would make the calculation it it isn't it would fail. I get that things are getting a bit more nuanced for integer indexing or boolean mask indexing but in that case the current situation is fine. I imagine most users still apply ColumnTransformer with named columns.

Let's see what others think.

jnothman · 2020-06-25T13:14:04Z

It seems like ColumnTransformer is different things to different people! I don't mind it having a bit more flexbilility in input, if named columns are always used.

…

thomasjpfan · 2020-06-25T13:34:28Z

From a user point of view, I think having more flexibility with named columns would be nice.

mxdev88 · 2020-11-01T14:46:35Z

Flexibility for named columns would indeed but extremely useful. I would expect an error only if I'm trying to transform X with missing column(s) during fit. For example, I would like to have several versions of a predicting model. Overtime I will make the model "better" (hopefully) by adding new features in. Enforcing strict column order or size means this is not possible but also means I need to pre-process the preprocess and keep a separate mapping of fields used for each model over time?
I understand this is an issue for positional columns though.

adrinjalali · 2020-11-01T16:06:03Z

What kind of bugs you foresee? If the column is there, it would make the calculation it it isn't it would fail.

Not when we have a remaining option. I'd be okay with a SelectColumnsTransformer which would select the columns as a first step in a pipeline, and output a DataFrame, but I rather not include the complexity in ColumnTransformer.

hmonteiro · 2020-11-13T19:55:57Z

I was also bit by this. I also am of the opinion that if a column is not used at fit time (i.e. it's nor part of any selection and remainder is set to drop) it should not fail at transform time. In fact, if the input is a dataframe and all the column selections are done based on column names and not position indexes, then it doesn't even matter what the value of remainder is, as we could just save which column names were used at fit time.

lorentzenchr · 2021-01-07T12:19:02Z

What is the status (or roadmap) of this discussion. I'm surprised to see that enforcement of strict column order was merged in #14544, although the discussion in this issues hints at no consensus on that. From a practical point of view, I agree 100% with every point made by @rth.

adrinjalali · 2021-01-13T16:50:42Z

I guess my worry is the amount of complexity we're adding to ColumnTransformer. I would rather have choosing columns from a random input data in a different module/class.

lorentzenchr · 2021-01-13T17:38:10Z

My worry is that enforcing column order in ColumnTransformer, if columns are specified by name, makes our only official gateway for dataframe input effectively unusable. And - if I may ask - what for?

The problem with a possible ColumnSelector is that, at the current state, we wouldn't be able to use column names in a subsequent ColumnTransformer. It would also add - in my point of view - a completely unnecessary step.

In my view, column names are an API:

names select by, well, names - no matter the physical order of columns.
integers select by position - logical or physical.

rth · 2021-01-14T08:18:36Z

names select by, well, names - no matter the physical order of columns.

I agree that column selection API that requires both have the right column name and an exact position is overly constraining and not very usable.

@glemaitre @jeremiedbb @thomasjpfan You approved #18256 at the time. Could you please comment on the above discussion? Thanks!

lorentzenchr · 2021-08-26T14:22:48Z

Discussion was resolved in https://github.com/scikit-learn/administrative/blob/master/monthly_meetings/2021-01-25.md.

amueller mentioned this issue Jul 3, 2019

[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237

Merged

8000

amueller added the API label Jul 3, 2019

adrinjalali mentioned this issue Jul 13, 2019

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

Closed

amueller added the help wanted label Jul 30, 2019

adrinjalali removed the help wanted label Jul 31, 2019

adrinjalali mentioned this issue Aug 1, 2019

ColumnTransformer input feature name and count validation #14544

Merged

amueller closed this as completed in #14544 Aug 7, 2019

amueller mentioned this issue Sep 24, 2019

ColumnTransformer throws ValueError inappropriately #15085

Closed

rth reopened this Jun 25, 2020

madhuracj mentioned this issue Aug 25, 2020

FIX Enforce strict column name order/count in ColumnTransformer #18256

Merged

thomasjpfan mentioned this issue Nov 14, 2020

ENH Record output of transformers in ColumnTransformer #18393

Merged

NicolasHug mentioned this issue Jan 14, 2021

ColumnTransformer requires excluded columns #19168

Closed

thomasjpfan mentioned this issue Jan 24, 2021

ENH Makes ColumnTransformer more flexible by only checking for non-dropped columns #19263

Merged

lorentzenchr closed this as completed in #19263 Apr 29, 2021

lorentzenchr mentioned this issue Nov 14, 2024

[RFC] Automatically re-order columns of dataframe-like inputs microsoft/LightGBM#6682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC ColumnTransformer input validation and requirements #14251

RFC ColumnTransformer input validation and requirements #14251

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RFC ColumnTransformer input validation and requirements #14251

RFC ColumnTransformer input validation and requirements #14251

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!