-
Hello, I have been using Looking at the docs and testing it out myself, I can't seem to access multiple columns at once within the custom transformer. Example code below: ct = ColumnTransformer(
[("multiplier", Multiplier(column1=x, column2=y), ['x', 'y'])]) Multiplier transformer: import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class Multiplier(BaseEstimator, TransformerMixin):
def __init__(self, column1, column2):
self.column1 = column1
self.column2 = column2
def fit(self, X, y=None):
return self
def transform(self, X):
out = X[self.column1] * X[self.column2]
return out.values.reshape(-1,1) Is this possible using Appreciate it! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Your code should be working. I will provide an example showing that it works # %%
from sklearn.datasets import load_iris
from sklearn.utils import shuffle
iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
y = iris.target_names[y]
X, y = shuffle(X, y, random_state=0)
# %%
from sklearn.base import BaseEstimator, TransformerMixin
class Multiplier(TransformerMixin, BaseEstimator):
def __init__(self, column_name_1, columns_name_2):
self.column_name_1 = column_name_1
self.columns_name_2 = columns_name_2
def fit(self, X, y=None):
return self
def transform(self, X):
return (X[self.column_name_1] * X[self.columns_name_2]).to_frame()
# %%
column_name_1 = "sepal length (cm)"
column_name_2 = "sepal width (cm)"
multiplier = Multiplier(column_name_1=column_name_1, columns_name_2=column_name_2)
multiplier.fit_transform(X) This would output a dataframe with a single column. We can easily add this multiplier in a scikit-learn pipeline. # %%
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
preprocessor = ColumnTransformer(transformers=[
("multiplier", multiplier, [column_name_1, column_name_2])
], remainder="passthrough")
model = Pipeline(steps=[
("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(random_state=0))
])
model.fit(X, y).predict(X[:5]) This will output some targets as expected. However, you can have some issues if you have some preprocessing steps in the pipeline since the column name will be lost due to some input validation done by scikit-learn. To illustrate: # %%
from sklearn.preprocessing import StandardScaler
model = Pipeline(steps=[
("scaler", StandardScaler()),
("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(random_state=0))
])
model
# %%
model[:-2].fit_transform(X) The first step which is a scaler will transform the original datframe into a numpy array and will lose the column names. Therefore, your multiplier cannot use anymore the column names. Since scikit-learn 1.2, we provide a way to propagate column names using the # %%
model = Pipeline(steps=[
("scaler", StandardScaler().set_output(transform="pandas")),
("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(random_state=0))
])
model
# %%
model[:-2].fit_transform(X)
# %%
model.fit(X, y).predict(X[:5])
# %%
model[:-1].fit_transform(X) and in this case, everything works. With the transformer that you are using that is stateless, I would advise you to use the |
Beta Was this translation helpful? Give feedback.
-
Hi Thanks for providing so much clarity on this. I have a quick question. Can my functionTransformer create a new column that can be added to the dataframe in the event that I have more preprocessing steps that would require the name of that column? In other words, can my functionTransformer or should my functionTransform ever modify the dataFrame by adding new columns and if so how would you send the column names to be added to the dataFrame. |
Beta Was this translation helpful? Give feedback.
Your code should be working. I will provide an example showing that it works