8000 Random forest regression fails when calling data: probably a numerical error · Issue #29922 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Random forest regression fails when calling data: probably a numerical error #29922
Open
@facusapienza21

Description

@facusapienza21

Describe the bug

It is known that random forrest regression (as well as many decision tree-based methods) are not affected by the scale of the data and don't require any scaling in the feature matrix or response vector. This includes all types of scaling, like standard normalization (remove the mean, divide by the standard deviation) as well as simple scale scaling (constant multiplication or general linear transformations).

However, here there is an example where the absolute scale drastically affects the performance of random forest. Just by multiplying the response by a small number, the performance drastically falls. I am pretty sure this is associated to numerical errors, but notice that the scale factor is not close to machine epsilon.

**Note: ** I actually found this example by first noticing that RF was drastically failing with my scientific data, and fixing it by rescaling the response vector to more reasonable values. This is of course a very simple solution, but I can imagine many users having similar problems and not being able to find this fix given that this should not be required.

I am more than happy to help fixing this bug, but I wanted to documented it first and check with the developers first in case there is something I am missing.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(666)
n, p = 1000, 10

# Generate some feature matrix
X = np.random.normal(size=(n,p))
# Generate some simple feature response to predict
Y = 0.5 * X[:, 0] + X[:, 1] + np.random.normal(scale=0.1, size=(n,))

# This breaks at scales ~ 1e-5
response_scale_X = 1
# For response scale smaller than 1e-8 the prediction breaks
response_scale_Y = 1e-8

# Multiply response and/or feature by a numerical constant
X *= response_scale_X
Y *= response_scale_Y

model_rf = RandomForestRegressor(n_estimators=300, 
                                 random_state=616, 
                                 max_depth=4, 
                                 verbose=1)

model_rf.fit(X, Y)

Y_pred = model_rf.predict(X)

# Evaluate the model
rmse = mean_squared_error(Y, Y_pred, squared=False)
r2 = r2_score(Y, Y_pred)

Expected Results

This are the results when response_scale_X = 1 and response_scale_Y = 1. You can see a very good prediction level

> print(f"RMSE: {rmse}")
> print(f"R² Score: {r2}")
RMSE: 0.22987215286423807
R² Score: 0.9561243282262417
image

Actual Results

Now, when changing the scale factor we see how performance deteriorates. Here an example for response_scale_X=1e-6. When smaller, the response is just constant.

RMSE: 0.9899243441583097
R² Score: 0.23064636597588262
image

Versions

This is the version I am using, but I observed the issue with other machines and sklearn versions too.

System:
    python: 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:31:54) [Clang 16.0.6 ]
executable: /usr/local/Caskroom/miniforge/base/bin/python3.12
   machine: macOS-14.6.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.5.2
          pip: 24.2
   setuptools: 73.0.1
        numpy: 2.0.2
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.9.2
       joblib: 1.3.0
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/facundosapienza/.local/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0