[WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor #27932

jimthompson5802 · 2023-12-11T02:37:14Z

Reference Issues/PRs

Fixes #5368 (addresses the unfullifed request for Huber Loss)

What does this implement/fix? Explain your changes.

Adds Huber Loss as a valid criterion to the cited estimators. Updated relevant unit tests to for the new criterion.

Any other comments?

Here is execution of the relevant CI tests after the modifications.

================================================== test session starts ===================================================
platform linux -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /workspaces/scikit-learn
configfile: setup.cfg
plugins: cov-4.1.0
collected 1366 items                                                                                                     

../sklearn/tree/tests/test_export.py .............                                                                 [  0%]
../sklearn/tree/tests/test_monotonic_tree.py ..................................................................... [  6%]
.............................................................................                                      [ 11%]
../sklearn/tree/tests/test_reingold_tilford.py ..                                                                  [ 11%]
../sklearn/tree/tests/test_tree.py ............................................................................... [ 17%]
.................................................................................................................. [ 25%]
.................................................................................................................. [ 34%]
.................................................................................................................. [ 42%]
..........................................                                                                         [ 45%]
../sklearn/ensemble/tests/test_bagging.py ........................................................................ [ 50%]
..............................                                                                                     [ 53%]
../sklearn/ensemble/tests/test_base.py ..                                                                          [ 53%]
../sklearn/ensemble/tests/test_common.py ...................                                                       [ 54%]
../sklearn/ensemble/tests/test_forest.py ......................................................................... [ 60%]
.................................................................................................................. [ 68%]
.................................................................................................................. [ 76%]
............                                                                                                       [ 77%]
../sklearn/ensemble/tests/test_gradient_boosting.py ........................s..................................... [ 82%]
.................................................................                                                  [ 86%]
../sklearn/ensemble/tests/test_iforest.py .............................                                            [ 89%]
../sklearn/ensemble/tests/test_stacking.py ....................................................................... [ 94%]
                                                                                                                   [ 94%]
../sklearn/ensemble/tests/test_voting.py .............................                                             [ 96%]
../sklearn/ensemble/tests/test_weight_boosting.py ..................................................               [100%]

=============================== 1365 passed, 1 skipped, 212 warnings in 106.18s (0:01:46) ================================
vscode ➜ /workspaces/scikit-learn/sandbox (huber-loss) $

…gressor

constructor

constructor and _huber_loss method

…string

squared error loss and Huber loss on a dataset with outliers

DecisionTreeRegressor api doc

github-actions · 2023-12-11T02:38:35Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 2100f4a. Link to the linter CI: here}

constructor

glemaitre · 2023-12-11T11:12:27Z

I am wondering if this is actually something that we need since we have the abolute_error. Do we have a gain in terms of fitting performance (I mean the time to train)?

jimthompson5802 · 2023-12-11T12:07:32Z

@glemaitre Good question.

This came up from couple of my colleagues. As I understand their issue, they have data with outliers. They were wondering if RF with Huber loss would generate a better performing model than with "squared_error".

This attempts to illusrates the scenario. Test scenario:

generate single feature synthetic regression data and force a subset of the data to be outliners.
Do train/test split
Train RF model with criterion="squared_error"
Train RF model changing only criterion to "huber_loss"
Compute MSE on test data split for both models.

At least for this example, Huber gives a lower MSE. Whether this is signficant or not will probably depend on the situation.

this shows the difference in the test data metric.

The point you bring up re: "absolute_error" is a valid point. Let me extend the example and do a three-way comparison: "squared_error", "absolute_error" and "huber" to see how model performance is affected.

jimthompson5802 · 2023-12-12T03:00:06Z

@glemaitre Hopefully this will answer your question, "Do we have a gain in terms of fitting performance (I mean the time to train)?"

The aswer is "Yes, depending on the delta parameter setting." The precent reduction over the absolute_error run-time ranges from about 10% (delta=1.0) to just over 25% (deleta=0.0001).

Let me know if this answered your question. Hopefully this time reduction will be viewed as a benefit to the project.

Test procedure

Generate a synthetic dataset using the make_regression function from sklearn.
Define parameters for the RandomForestRegressor models.
Split the generated dataset into training and testing sets.
Define three RandomForestRegressor models with different criteria: default (MSE), absolute error, and Huber.
Train each model on the training data and evaluate it on the testing data, collecting the results (criterion, MSE, MAE, and execution time) in a list.
Calculate the percent reduction in training time between the absolute error and Huber criteria.
Create a figure with two subplots: one for the training time of each criterion and one for the MSE and MAE of each criterion.

Test Results

`delta=0.0001`

`delta=0.001`

`delta=0.5`

`delta=1.0`

Code to reproduce the above test

To run this code will require installing the version of scikit-learn on this branch: https://github.com/jimthompson5802/scikit-learn/tree/huber-loss

import time
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# Define a function to train a model and report the results
def train_model_and_report_time(model, X_train, y_train, X_test, y_test):
    criterion = model.criterion
    print(f"Training model with {criterion} criterion")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    execution_time = end_time - start_time
    mse = mean_squared_error(model.predict(X_test), y_test)  
    mae = mean_absolute_error(model.predict(X_test), y_test)  

    return {"criterion": criterion, "mse": mse, "mae": mae, "execution_time": execution_time}

# Random Forest parameters
N_ESIMATORS = 200
MAX_DEPTH = 8
DELTA = 0.0001

if __name__ == "__main__":
    # Generate a synthetic dataset
    X, y = make_regression(
        n_samples=2000,
        n_features=5,
        tail_strength=0.9,
        effective_rank=1,
        n_informative=1,
        noise=3,
        bias=20,
        random_state=1,
    )

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # 
    models = [
        RandomForestRegressor(
            max_depth=MAX_DEPTH, n_estimators=N_ESIMATORS, random_state=42
        ),
        RandomForestRegressor(
            criterion="absolute_error",
            max_depth=MAX_DEPTH, n_estimators=N_ESIMATORS, random_state=42
        ),
        RandomForestRegressor(
            criterion="huber",
            delta=DELTA,
            max_depth=MAX_DEPTH,
            n_estimators=N_ESIMATORS,
            random_state=42,
        ),
    ]

    # Train the models and report the results
    results = []
    for model in models:
        results.append(train_model_and_report_time(model, X_train, y_train, X_test, y_test))

    # Create a DataFrame from the results
    df = pd.DataFrame(results)
    print(df)

    # Calculate the percent reduction in training time
    absolute_error_time = df.loc[df.criterion == 'absolute_error', 'execution_time'].values[0]
    huber_time = df.loc[df.criterion == 'huber', 'execution_time'].values[0]
    time_reduction = absolute_error_time - huber_time
    percent_reduction = time_reduction / absolute_error_time * 100


    # Create a figure and two subplots side-by-side
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))

    # Plot the training time on the first subplot
    ax1.bar(df['criterion'], df['execution_time'])
    ax1.set_xlabel('Criterion')
    ax1.set_ylabel('Elapsed Time (sec)')
    ax1.text(
        2, 
        huber_time, 
        f'delta={DELTA:0.4f}\nPct Decrease:\n{percent_reduction:.2f}%', 
        ha='center', 
        va='bottom'
        )
    ax1.set_title('Training Time for Different Criteria')

    # Reshape the DataFrame
    df_melted = df.melt(id_vars='criterion', value_vars=['mse', 'mae'], var_name='measure', value_name='value')

    # Plot the MSE and MAE on the second subplot by criterion
    sns.barplot(x='measure', y='value', hue='criterion', data=df_melted, ax=ax2)
    ax2.set_xlabel('Criterion')
    ax2.set_ylabel('Value')
    ax2.set_title('MSE and MAE for Different Criteria\nTest Data Set')
    ax2.legend()

    # save plot
    plt.savefig(f'/workspaces/scikit-learn/sandbox/random_forest_training_time_delta_{DELTA}.png')

    # Show the plot
    plt.show()

jimthompson5802 added 19 commits December 5, 2023 12:13

initial add for huber loss

37ac65f

WIP: checkpoint work

d237268

WIP

0509cbc

WIP: before refactor to adopt poisson loss

fdb24a7

initial working version of huber loss with debugging msgs

b42a89b

comment out debugging msgs

6384eeb

Merge branch 'main' into huber-loss

1e8721b

debugging messages for huber loss computations

faeebea

Add delta parameter to Huber criterion and enabled for DecisionTreeRe…

e430d08

…gressor

Add delta parameter to RandomForestRegressor

ace9f62

constructor

Commented out print statements in Huber class

bbe9e06

constructor and _huber_loss method

Add "huber" criterion for robust regression RandomForestRegressor doc…

10d8deb

…string

Add "huber" criterion to REG_CRITERIONS for tree test

bee9d0b

Add 'huber' criterion to regression tests for ensemble models

f602b80

refactor huber loss for performance and update docstring

2a2fc35

Fix initialization and remove redundant setting of attributes

77ec7b8

Add script to compare RandomForestRegressor with

6b1f59f

squared error loss and Huber loss on a dataset with outliers

remove debugging messages and fix up docstrings

9203d28

Corrected "huber" criterion to RandomForestRegressor and

3189800

DecisionTreeRegressor api doc

github-actions bot added module:ensemble module:tree cython labels Dec 11, 2023

jimthompson5802 added 5 commits December 11, 2023 02:42

Remove commented print statement in Huber class

1ab549f

constructor

address lint issues

0c7298a

fix cython-lint issues

292cb76

fix code style issues

0c08838

fix code style issues

2100f4a

jimthompson5802 changed the title ~~[MRG] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor~~ [WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor Dec 11, 2023

jimthompson5802 marked this pull request as draft December 11, 2023 10:48

jimthompson5802 closed this Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor #27932

[WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor #27932

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor #27932

[WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor #27932

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Test procedure

Test Results

delta=0.0001

delta=0.001

delta=0.5

delta=1.0

Code to reproduce the above test

Uh oh!

Uh oh!

`delta=0.0001`

`delta=0.001`

`delta=0.5`

`delta=1.0`