Wrap tabular data in a new dataclass to simplify ML pipelines #25126

zkurtz · 2022-12-07T01:23:30Z

Describe the workflow you want to enable

In my dreams, a new InferenceData class would simplify training and prediction to look more like

import ... as learner

data = InferenceData(
    df=..., # a data frame 
    meta=Meta(
        y_cols=..., # name(s) of output variable
        ... # additional metadata fields
    )
)
train_data, test_data = data.split(train_fraction=0.7, ...)
learner.fit(train_data)
predictions = learner.predict(test_data.x)

Note that this introduces just two data variables train_data and test_data instead of the current standard four (X_train, X_test, y_train, y_test).

In addition, InferenceData could easily be extended to allow the above pipeline to handle related metadata such as feature weights, replacing a step like learner.fit(train_data) by learner.fit(train_data, weights=train_data.row_weights), for example.

Describe your proposed solution

A solution could look something like this:

@dataclass
class Meta:
    """Metadata for a Data class."""
    y_cols: Optional[list[str]] = None
    row_weights_col: Optional[str] = None

    @property
    def y(self) -> list[str]:
        """Output variable column names."""
        if not self.y_cols:
            return []
        return self.y_cols
    
    @property
    def columns(self) -> set[str]:
        """All metadata columns."""
        cols = set(self.y)
        if self.row_weights_col:
            cols.add(row_weights_col)
        return cols


@dataclass
class InferenceData:
    """A data frame container that includes metadata relevant for machine learning and inference."""
   
    df: DataFrame
    meta: Meta = Meta()
        
    def __post_init__(self) -> None:
        """Parameter validation."""
        if (self.y is not None) and (self.n_rows != len(self.y)):
            raise ValueError("Expected y to have the same number of data points as x has.")
        # TODO: also validate that all columns referenced in self.meta exist in self.df etc
    
    @property
    def x(self) -> DataFrame:
        """The data frame of predictor variables, excluding output variables and weights, etc."""
        non_metadata_cols = [col for col in self.df if col not in self.meta.columns]
        return self.df[non_metadata_cols]
    
    @property
    def y(self) -> None | np.ndarray:
        """Output features."""
        if not self.meta.y_cols:
            return None
        if len(self.meta.y_cols) == 1:
            y_col = self.meta.y_cols[0]
            return self.df[y_col].to_numpy()
        return self.df[self.meta.y_cols].to_numpy()

    @property
    def w(self) -> None | np.ndarray:
        """The data frame of predictor variables, excluding output variables and weights, etc."""
        return self.df[row_weights_col]

    def iloc(self, positional_indexes: Iterable) -> "InferenceData":
        """Return the subset of the data in self corresponding to the specificied positional indices."""
        return InferenceData(df=df.iloc[positional_indexes], meta=self.meta)
    
    @property
    def n_rows(self) -> int:
        """Number of rows of data."""
        return self.df.shape[0]

Of course, the usefulness of this solution will depend on wrapping existing machine learning algorithms to accept an InferenceData class as input.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

adrinjalali · 2022-12-07T08:44:34Z

Duplicate of #13123

zkurtz added Needs Triage Issue requires triage New Feature labels Dec 7, 2022

adrinjalali marked this as a duplicate of #13123 Dec 7, 2022

adrinjalali closed this as completed Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Wrap tabular data in a new dataclass to simplify ML pipelines #25126

Wrap tabular data in a new dataclass to simplify ML pipelines #25126

Uh oh!

Uh oh!

Wrap tabular data in a new dataclass to simplify ML pipelines #25126

Wrap tabular data in a new dataclass to simplify ML pipelines #25126

Comments

Uh oh!

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Uh oh!