Custom transformer for ColumnTransformer to access multiple columns with pandas input #25201

wijayam · 2022-12-16T21:11:20Z

wijayam
Dec 16, 2022

Hello, I have been using sklearn-pandas's DataFrameMapper due to the flexibility of building a custom transformer to take in 2 or more columns and transform with another column(ie. given input of column x and column y, create a new column say z that is x * y).

Looking at the docs and testing it out myself, I can't seem to access multiple columns at once within the custom transformer. Example code below:

ct = ColumnTransformer(
   [("multiplier", Multiplier(column1=x, column2=y), ['x', 'y'])])

Multiplier transformer:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class Multiplier(BaseEstimator, TransformerMixin):
    def __init__(self, column1, column2):
        self.column1 = column1
        self.column2 = column2

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        out = X[self.column1] * X[self.column2]
        return out.values.reshape(-1,1)

Is this possible using ColumnTransformer? I find everything else works as well as sklearn-pandas package except for the one I mentioned above.

Appreciate it!

Answered by glemaitre

Dec 18, 2022

Your code should be working. I will provide an example showing that it works

# %%
from sklearn.datasets import load_iris
from sklearn.utils import shuffle

iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
y = iris.target_names[y]
X, y = shuffle(X, y, random_state=0)

# %%
from sklearn.base import BaseEstimator, TransformerMixin


class Multiplier(TransformerMixin, BaseEstimator):
    def __init__(self, column_name_1, columns_name_2):
        self.column_name_1 = column_name_1
        self.columns_name_2 = columns_name_2

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (X[self.column_name_1] * X[self.columns_name_2]).to_frame()


# %%

View full answer

glemaitre · 2022-12-18T11:39:25Z

glemaitre
Dec 18, 2022
Maintainer

Your code should be working. I will provide an example showing that it works

# %%
from sklearn.datasets import load_iris
from sklearn.utils import shuffle

iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
y = iris.target_names[y]
X, y = shuffle(X, y, random_state=0)

# %%
from sklearn.base import BaseEstimator, TransformerMixin


class Multiplier(TransformerMixin, BaseEstimator):
    def __init__(self, column_name_1, columns_name_2):
        self.column_name_1 = column_name_1
        self.columns_name_2 = columns_name_2

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return (X[self.column_name_1] * X[self.columns_name_2]).to_frame()


# %%
column_name_1 = "sepal length (cm)"
column_name_2 = "sepal width (cm)"
multiplier = Multiplier(column_name_1=column_name_1, columns_name_2=column_name_2)
multiplier.fit_transform(X)

This would output a dataframe with a single column. We can easily add this multiplier in a scikit-learn pipeline.

# %%
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier

preprocessor = ColumnTransformer(transformers=[
    ("multiplier", multiplier, [column_name_1, column_name_2])
], remainder="passthrough")
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(random_state=0))
])
model.fit(X, y).predict(X[:5])

This will output some targets as expected.

However, you can have some issues if you have some preprocessing steps in the pipeline since the column name will be lost due to some input validation done by scikit-learn. To illustrate:

# %%
from sklearn.preprocessing import StandardScaler

model = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(random_state=0))
])
model

# %%
model[:-2].fit_transform(X)

The first step which is a scaler will transform the original datframe into a numpy array and will lose the column names. Therefore, your multiplier cannot use anymore the column names. Since scikit-learn 1.2, we provide a way to propagate column names using the set_output (more info here). We can therefore configure StandardScaler to output column name:

# %%
model = Pipeline(steps=[
    ("scaler", StandardScaler().set_output(transform="pandas")),
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(random_state=0))
])
model

# %%
model[:-2].fit_transform(X)

# %%
model.fit(X, y).predict(X[:5])

# %%
model[:-1].fit_transform(X)

and in this case, everything works.

With the transformer that you are using that is stateless, I would advise you to use the FunctionTransformer. It will even require less code.

3 replies

wijayam Dec 19, 2022
Author

Thanks for the info! I will check again to make sure what you stated is working on my side (it should be).

I looked at the FunctionTransformer doc and seem to be a new addition.

Do you mind giving an example of how to implement the Multiplier above using FunctionTransformer instead? Thanks!

glemaitre Dec 19, 2022
Maintainer

`FunctionTransformer is pretty old it is from 0.17 ;)
The example would give something like:

# %%
from sklearn.datasets import load_iris
from sklearn.utils import shuffle

iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
y = iris.target_names[y]
X, y = shuffle(X, y, random_state=0)

# %%
from sklearn.preprocessing import FunctionTransformer


def pure_func(X, column_name_1, column_name_2):
    return (X[column_name_1] * X[column_name_2]).to_frame()


multiplier = FunctionTransformer(
    pure_func,
    kw_args={"column_name_1": column_name_1, "column_name_2": column_name_2}
)

# %%
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier

preprocessor = ColumnTransformer(transformers=[
    ("multiplier", multiplier, [column_name_1, column_name_2])
], remainder="passthrough")

model = Pipeline(steps=[
    ("scaler", StandardScaler().set_output(transform="pandas")),
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(random_state=0))
])
model.fit(X, y).predict(X[:5])

wijayam Dec 19, 2022
Author

Thanks!

cnesty · 2025-02-02T23:16:26Z

cnesty
Feb 2, 2025

Hi Thanks for providing so much clarity on this. I have a quick question. Can my functionTransformer create a new column that can be added to the dataframe in the event that I have more preprocessing steps that would require the name of that column? In other words, can my functionTransformer or should my functionTransform ever modify the dataFrame by adding new columns and if so how would you send the column names to be added to the dataFrame.

1 reply

cnesty Feb 3, 2025

I think I figured it out. In the function transformer, you can set features_names_out = 'one-to-one' and if you also set verbose_feature_names_out = False, you'll have the names of the columns without the prepended transformer names. Thanks for the explanation you gave it was pretty helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom transformer for ColumnTransformer to access multiple columns with pandas input #25201

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom transformer for ColumnTransformer to access multiple columns with pandas input #25201

Uh oh!

wijayam Dec 16, 2022

Replies: 2 comments · 4 replies

Uh oh!

glemaitre Dec 18, 2022 Maintainer

Uh oh!

wijayam Dec 19, 2022 Author

Uh oh!

glemaitre Dec 19, 2022 Maintainer

Uh oh!

wijayam Dec 19, 2022 Author

Uh oh!

cnesty Feb 2, 2025

Uh oh!

cnesty Feb 3, 2025

wijayam
Dec 16, 2022

Replies: 2 comments 4 replies

glemaitre
Dec 18, 2022
Maintainer

wijayam Dec 19, 2022
Author

glemaitre Dec 19, 2022
Maintainer

wijayam Dec 19, 2022
Author

cnesty
Feb 2, 2025