Description
Describe the bug
It is known that random forrest regression (as well as many decision tree-based methods) are not affected by the scale of the data and don't require any scaling in the feature matrix or response vector. This includes all types of scaling, like standard normalization (remove the mean, divide by the standard deviation) as well as simple scale scaling (constant multiplication or general linear transformations).
However, here there is an example where the absolute scale drastically affects the performance of random forest. Just by multiplying the response by a small number, the performance drastically falls. I am pretty sure this is associated to numerical errors, but notice that the scale factor is not close to machine epsilon.
**Note: ** I actually found this example by first noticing that RF was drastically failing with my scientific data, and fixing it by rescaling the response vector to more reasonable values. This is of course a very simple solution, but I can imagine many users having similar problems and not being able to find this fix given that this should not be required.
I am more than happy to help fixing this bug, but I wanted to documented it first and check with the developers first in case there is something I am missing.
Steps/Code to Reproduce
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
np.random.seed(666)
n, p = 1000, 10
# Generate some feature matrix
X = np.random.normal(size=(n,p))
# Generate some simple feature response to predict
Y = 0.5 * X[:, 0] + X[:, 1] + np.random.normal(scale=0.1, size=(n,))
# This breaks at scales ~ 1e-5
response_scale_X = 1
# For response scale smaller than 1e-8 the prediction breaks
response_scale_Y = 1e-8
# Multiply response and/or feature by a numerical constant
X *= response_scale_X
Y *= response_scale_Y
model_rf = RandomForestRegressor(n_estimators=300,
random_state=616,
max_depth=4,
verbose=1)
model_rf.fit(X, Y)
Y_pred = model_rf.predict(X)
# Evaluate the model
rmse = mean_squared_error(Y, Y_pred, squared=False)
r2 = r2_score(Y, Y_pred)
Expected Results
This are the results when response_scale_X = 1
and response_scale_Y = 1
. You can see a very good prediction level
> print(f"RMSE: {rmse}")
> print(f"R² Score: {r2}")
RMSE: 0.22987215286423807
R² Score: 0.9561243282262417

Actual Results
Now, when changing the scale factor we see how performance deteriorates. Here an example for response_scale_X=1e-6
. When smaller, the response is just constant.
RMSE: 0.9899243441583097
R² Score: 0.23064636597588262

Versions
This is the version I am using, but I observed the issue with other machines and sklearn versions too.
System:
python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:31:54) [Clang 16.0.6 ]
executable: /usr/local/Caskroom/miniforge/base/bin/python3.12
machine: macOS-14.6.1-x86_64-i386-64bit
Python dependencies:
sklearn: 1.5.2
pip: 24.2
setuptools: 73.0.1
numpy: 2.0.2
scipy: 1.14.1
Cython: None
pandas: 2.2.2
matplotlib: 3.9.2
joblib: 1.3.0
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/facundosapienza/.local/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
version: None