FunctionTransformer gets numpy instead of pandas after ColumnTransformer #22297

tomateit · 2022-01-25T09:24:54Z

tomateit
Jan 25, 2022

In a Pipeline, if I use FunctionTransformer after ColumnTransformer, the FunctionTransformer will get numpy array instead of original dataframe. How do I implement per-column transformation, which keeps my data in shape of DataFrame?

Minimal Example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

import numpy as np
import pandas as pd

data = pd.DataFrame({
    "label": [1,1,1,0,0,0],
    "v1":[np.array([1,2,3,4])] * 6,
    "v2":[np.array([5,6,7,8])] * 6,
    })

estimator = Pipeline([
        # ("f", ColumnTransformer(transformers=[("pass", FunctionTransformer(func=lambda x: pd.DataFrame(x)), "v1")], remainder='passthrough')),
        ("dummy", FunctionTransformer(func=lambda x: print(x))),
])

print(estimator.fit_transform(data.loc[:, ["v1", "v2"]], data.loc[:, "label"]))

If you run it this way, the DataFrame is printed. If you uncomment ColumnTransformer, then a numpy array is printed.

I do not suspect it to be a bug. Seems I just cannot guess the right return format for ColumnTransformer.

Answered by glemaitre

Jan 26, 2022

FuntionTransformer is quite of a flexible beast and by fixing validate=False (that is the default), the input data will not be validated and thus passed to the function directly.

Elsewhere in scikit-learn, input validation always happens and pandas dataframes are always converted into NumPy array because scikit-learn is going to make some numerical operations. This is the reason why ColumnTransformer output a NumPy array or a sparse matrix.

In the future, we will try to improve this part of the user experience by providing feature_names at different steps of a pipeline and maybe at some point, provide other types than NumPy array at the intermediate stages of the preprocessing.

View full answer

glemaitre · 2022-01-26T14:21:16Z

glemaitre
Jan 26, 2022
Maintainer

FuntionTransformer is quite of a flexible beast and by fixing validate=False (that is the default), the input data will not be validated and thus passed to the function directly.

Elsewhere in scikit-learn, input validation always happens and pandas dataframes are always converted into NumPy array because scikit-learn is going to make some numerical operations. This is the reason why ColumnTransformer output a NumPy array or a sparse matrix.

In the future, we will try to improve this part of the user experience by providing feature_names at different steps of a pipeline and maybe at some point, provide other types than NumPy array at the intermediate stages of the preprocessing.

2 replies

tomateit Jan 27, 2022
Author < 8000 /span>

Thanks for your answer!

So, to make it clear, it seems one can state, that there's no way to make ColumnTransformer preserve input format. The output of ColumnTransformer will always (for now) be a sparce matrix or NumPy array, and we cannot output pandas.DataFrame or pandas.Series.

I'm grateful for your clarification!

glemaitre Jan 27, 2022
Maintainer

So, to make it clear, it seems one can state, that there's no way to make ColumnTransformer preserve input format. The output of ColumnTransformer will always (for now) be a sparce matrix or NumPy array, and we cannot output pandas.DataFrame or pandas.Series.

Yes exactly. Not currently at least :)

imadtg · 2024-12-31T13:09:47Z

imadtg
Dec 31, 2024

in case anyone stumbles upon this in the future, this can be fixed using .set_output(transform="pandas") on the ColumnTransformer instance, although it might have some caveats, see https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FunctionTransformer gets numpy instead of pandas after ColumnTransformer #22297

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

FunctionTransformer gets numpy instead of pandas after ColumnTransformer #22297

Uh oh!

tomateit Jan 25, 2022

Replies: 2 comments · 2 replies

Uh oh!

glemaitre Jan 26, 2022 Maintainer

Uh oh!

tomateit Jan 27, 2022 Author < 8000 /span>

Uh oh!

glemaitre Jan 27, 2022 Maintainer

Uh oh!

imadtg Dec 31, 2024

tomateit
Jan 25, 2022

Replies: 2 comments 2 replies

glemaitre
Jan 26, 2022
Maintainer

tomateit Jan 27, 2022
Author < 8000 /span>

glemaitre Jan 27, 2022
Maintainer

imadtg
Dec 31, 2024