You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is perhaps not a bug but an opportunity for improvement. I've noticed that scikit-learn runs considerably faster if I happen to have import torch before any sklearn imports.
importtorch# The only differencefromsklearn.ensembleimportHistGradientBoostingRegressorimportnumpyasnpX=np.random.random(size=(50, 10000))
y=np.random.random(size=50)
estimator=HistGradientBoostingRegressor(verbose=True)
estimator.fit(X, y)
Here's the run times over 6 runs each on my actual code, the only difference being an import of torch
I know it's confusing that I'm importing torch but not using it, so to be clear, I don't use the torch module in any way on the page. I just happened to stumble across the performance improvement at one point when I imported torch for some other purpose. It's literally just sitting there as an 'unused import' making my code run much faster.
I've tested with a few other regressors, including RandomForestRegressor and GradientBoostingRegressor and I don't see any difference.
I compared os.environ in both cases and they're the same. I looked at sklearn.base.get_config() and they're identical in both cases too. I notice that torch sets OMP_NUM_THREADS to 10, while without the torch import this value is set to 20 (on my machine with 20 cores). But even manually setting this to 10 doesn't bridge the gap.
I don't know enough about torch or sklearn to be able to work out what else is going on, I'm guessing someone who's worked on HistGradientBoostingRegressor might know what's going on? Seems like there's a nice performance gain to be found somewhere in here.
Steps/Code to Reproduce
As above
Expected Results
Should be max fast all the time.
Actual Results
Is not max fast unless I import torch.
Also as a general thing it would be nice to be able to pass n_jobs to the constructor. Having something use all 20 cores is not always the fastest way.
I suspect an interaction between openmp runtimes linked to the Python processes. Based on the threadpoolctl info: section of your report, it seems that torch is shipping a version of libgomp that is configured to use only 10 threads (which I suspect is your number of physical cores on your machine) instead of 20.
You can configure that your number of physical cores is 10 by evaluating the following in your Python env:
If my intuition is correct, this problem should have been fixed in scikit-learn 1.3.0: in this release we have changed the number of threads used by default in OpenMP/Cython code in scikit-learn to match the number of physical cores of the machine instead of the number of logical cores as is typically the default in OpenMP runtimes (see #26082 for details).
Please let us know if upgrading scikit-learn to 1.3.0 fixes the problem for you.
Your intuition is amazing! I do indeed have a mere 10 physical cores (Today I Learned...), and upgrading to 1.3.0 brought the runtime down inline with the version that had torch installed.
Describe the bug
This is perhaps not a bug but an opportunity for improvement. I've noticed that scikit-learn runs considerably faster if I happen to have
import torch
before anysklearn
imports.This first block of code runs much slower:
Than this second block of code:
Here's the run times over 6 runs each on my actual code, the only difference being an import of

torch
I know it's confusing that I'm importing
torch
but not using it, so to be clear, I don't use the torch module in any way on the page. I just happened to stumble across the performance improvement at one point when I importedtorch
for some other purpose. It's literally just sitting there as an 'unused import' making my code run much faster.I've tested with a few other regressors, including
RandomForestRegressor
andGradientBoostingRegressor
and I don't see any difference.I compared
os.environ
in both cases and they're the same. I looked atsklearn.base.get_config()
and they're identical in both cases too. I notice that torch setsOMP_NUM_THREADS
to 10, while without the torch import this value is set to20
(on my machine with 20 cores). But even manually setting this to 10 doesn't bridge the gap.I don't know enough about
torch
orsklearn
to be able to work out what else is going on, I'm guessing someone who's worked onHistGradientBoostingRegressor
might know what's going on? Seems like there's a nice performance gain to be found somewhere in here.Steps/Code to Reproduce
As above
Expected Results
Should be max fast all the time.
Actual Results
Is not max fast unless I import torch.
Also as a general thing it would be nice to be able to pass
n_jobs
to the constructor. Having something use all 20 cores is not always the fastest way.Versions
The text was updated successfully, but these errors were encountered: