8000 [WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor by jimthompson5802 · Pull Request #27932 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor #27932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

jimthompson5802
Copy link
@jimthompson5802 jimthompson5802 commented Dec 11, 2023

Reference Issues/PRs

Fixes #5368 (addresses the unfullifed request for Huber Loss)

What does this implement/fix? Explain your changes.

Adds Huber Loss as a valid criterion to the cited estimators. Updated relevant unit tests to for the new criterion.

Any other comments?

Here is execution of the relevant CI tests after the modifications.

================================================== test session starts ===================================================
platform linux -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /workspaces/scikit-learn
configfile: setup.cfg
plugins: cov-4.1.0
collected 1366 items                                                                                                     

../sklearn/tree/tests/test_export.py .............                                                                 [  0%]
../sklearn/tree/tests/test_monotonic_tree.py ..................................................................... [  6%]
.............................................................................                                      [ 11%]
../sklearn/tree/tests/test_reingold_tilford.py ..                                                                  [ 11%]
../sklearn/tree/tests/test_tree.py .................
10000
.............................................................. [ 17%]
.................................................................................................................. [ 25%]
.................................................................................................................. [ 34%]
.................................................................................................................. [ 42%]
..........................................                                                                         [ 45%]
../sklearn/ensemble/tests/test_bagging.py ........................................................................ [ 50%]
..............................                                                                                     [ 53%]
../sklearn/ensemble/tests/test_base.py ..                                                                          [ 53%]
../sklearn/ensemble/tests/test_common.py ...................                                                       [ 54%]
../sklearn/ensemble/tests/test_forest.py ......................................................................... [ 60%]
.................................................................................................................. [ 68%]
.................................................................................................................. [ 76%]
............                                                                                                       [ 77%]
../sklearn/ensemble/tests/test_gradient_boosting.py ........................s..................................... [ 82%]
.................................................................                                                  [ 86%]
../sklearn/ensemble/tests/test_iforest.py .............................                                            [ 89%]
../sklearn/ensemble/tests/test_stacking.py ....................................................................... [ 94%]
                                                                                                                   [ 94%]
../sklearn/ensemble/tests/test_voting.py .............................                                             [ 96%]
../sklearn/ensemble/tests/test_weight_boosting.py ..................................................               [100%]

=============================== 1365 passed, 1 skipped, 212 warnings in 106.18s (0:01:46) ================================
vscode ➜ /workspaces/scikit-learn/sandbox (huber-loss) $ 

Copy link
github-actions bot commented Dec 11, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2100f4a. Link to the linter CI: here

@jimthompson5802 jimthompson5802 changed the title [MRG] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor [WIP] Add Huber loss criterion to DecisionTreeRegressor and RandomForestRegressor Dec 11, 2023
@jimthompson5802 jimthompson5802 marked this pull request as draft December 11, 2023 10:48
@glemaitre
Copy link
Member

I am wondering if this is actually something that we need since we have the abolute_error. Do we have a gain in terms of fitting performance (I mean the time to train)?

@jimthompson5802
Copy link
Author
jimthompson5802 commented Dec 11, 2023

@glemaitre Good question.

This came up from couple of my colleagues. As I understand their issue, they have data with outliers. They were wondering if RF with Huber loss would generate a better performing model than with "squared_error".

This attempts to illusrates the scenario. Test scenario:

  • generate single feature synthetic regression data and force a subset of the data to be outliners.
  • Do train/test split
  • Train RF model with criterion="squared_error"
  • Train RF model changing only criterion to "huber_loss"
  • Compute MSE on test data split for both models.

At least for this example, Huber gives a lower MSE. Whether this is signficant or not will probably depend on the situation.

this shows the difference in the test data metric.
image

The point you bring up re: "absolute_error" is a valid point. Let me extend the example and do a three-way comparison: "squared_error", "absolute_error" and "huber" to see how model performance is affected.

@jimthompson5802
Copy link
Author
jimthompson5802 commented Dec 12, 2023

@glemaitre Hopefully this will answer your question, "Do we have a gain in terms of fitting performance (I mean the time to train)?"

The aswer is "Yes, depending on the delta parameter setting." The precent reduction over the absolute_error run-time ranges from about 10% (delta=1.0) to just over 25% (deleta=0.0001).

Let me know if this answered your question. Hopefully this time reduction will be viewed as a benefit to the project.

Test procedure

  • Generate a synthetic dataset using the make_regression function from sklearn.
  • Define parameters for the RandomForestRegressor models.
  • Split the generated dataset into training and testing sets.
  • Define three RandomForestRegressor models with different criteria: default (MSE), absolute error, and Huber.
  • Train each model on the training data and evaluate it on the testing data, collecting the results (criterion, MSE, MAE, and execution time) in a list.
  • Calculate the percent reduction in training time between the absolute error and Huber criteria.
  • Create a figure with two subplots: one for the training time of each criterion and one for the MSE and MAE of each criterion.

Test Results

delta=0.0001

random_forest_training_time_delta_0_0001

delta=0.001

random_forest_training_time_delta_0_001

delta=0.5

random_forest_training_time_delta_0_5

delta=1.0

random_forest_training_time_delta_1_0

Code to reproduce the above test

To run this code will require installing the version of scikit-learn on this branch: https://github.com/jimthompson5802/scikit-learn/tree/huber-loss

import time
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# Define a function to train a model and report the results
def train_model_and_report_time(model, X_train, y_train, X_test, y_test):
    criterion = model.criterion
    print(f"Training model with {criterion} criterion")
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    execution_time = end_time - start_time
    mse = mean_squared_error(model.predict(X_test), y_test)  
    mae = mean_absolute_error(model.predict(X_test), y_test)  

    return {"criterion": criterion, "mse": mse, "mae": mae, "execution_time": execution_time}

# Random Forest parameters
N_ESIMATORS = 200
MAX_DEPTH = 8
DELTA = 0.0001

if __name__ == "__main__":
    # Generate a synthetic dataset
    X, y = make_regression(
        n_samples=2000,
        n_features=5,
        tail_strength=0.9,
        effective_rank=1,
        n_informative=1,
        noise=3,
        bias=20,
        random_state=1,
    )

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # 
    models = [
        RandomForestRegressor(
            max_depth=MAX_DEPTH, n_estimators=N_ESIMATORS, random_state=42
        ),
        RandomForestRegressor(
            criterion="absolute_error",
            max_depth=MAX_DEPTH, n_estimators=N_ESIMATORS, random_state=42
        ),
        RandomForestRegressor(
            criterion="huber",
            delta=DELTA,
            max_depth=MAX_DEPTH,
            n_estimators=N_ESIMATORS,
            random_state=42,
        ),
    ]

    # Train the models and report the results
    results = []
    for model in models:
        results.append(train_model_and_report_time(model, X_train, y_train, X_test, y_test))

    # Create a DataFrame from the results
    df = pd.DataFrame(results)
    print(df)

    # Calculate the percent reduction in training time
    absolute_error_time = df.loc[df.criterion == 'absolute_error', 'execution_time'].values[0]
    huber_time = df.loc[df.criterion == 'huber', 'execution_time'].values[0]
    time_reduction = absolute_error_time - huber_time
    percent_reduction = time_reduction / absolute_error_time * 100


    # Create a figure and two subplots side-by-side
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))

    # Plot the training time on the first subplot
    ax1.bar(df['criterion'], df['execution_time'])
    ax1.set_xlabel('Criterion')
    ax1.set_ylabel('Elapsed Time (sec)')
    ax1.text(
        2, 
        huber_time, 
        f'delta={DELTA:0.4f}\nPct Decrease:\n{percent_reduction:.2f}%', 
        ha='center', 
        va='bottom'
        )
    ax1.set_title('Training Time for Different Criteria')

    # Reshape the DataFrame
    df_melted = df.melt(id_vars='criterion', value_vars=['mse', 'mae'], var_name='measure', value_name='value')

    # Plot the MSE and MAE on the second subplot by criterion
    sns.barplot(x='measure', y='value', hue='criterion', data=df_melted, ax=ax2)
    ax2.set_xlabel('Criterion')
    ax2.set_ylabel('Value')
    ax2.set_title('MSE and MAE for Different Criteria\nTest Data Set')
    ax2.legend()

    # save plot
    plt.savefig(f'/workspaces/scikit-learn/sandbox/random_forest_training_time_delta_{DELTA}.png')

    # Show the plot
    plt.show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request more criterion for random forest regression
2 participants
0