10000 HistGradientBoostingRegressor is slower when torch not imported · Issue #26752 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

HistGradientBoostingRegressor is slower when torch not imported #26752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidgilbertson opened this issue Jul 3, 2023 · 2 comments
Closed
Labels
Bug Needs Triage Issue requires triage

Comments

@davidgilbertson
Copy link
Contributor

Describe the bug

This is perhaps not a bug but an opportunity for improvement. I've noticed that scikit-learn runs considerably faster if I happen to have import torch before any sklearn imports.

This first block of code runs much slower:

from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np


X = np.random.random(size=(50, 10000))
y = np.random.random(size=50)

estimator = HistGradientBoostingRegressor(verbose=True)
estimator.fit(X, y)

Than this second block of code:

import torch  # The only difference
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np


X = np.random.random(size=(50, 10000))
y = np.random.random(size=50)

estimator = HistGradientBoostingRegressor(verbose=True)
estimator.fit(X, y)

Here's the run times over 6 runs each on my actual code, the only difference being an import of torch
image

I know it's confusing that I'm importing torch but not using it, so to be clear, I don't use the torch module in any way on the page. I just happened to stumble across the performance improvement at one point when I imported torch for some other purpose. It's literally just sitting there as an 'unused import' making my code run much faster.

I've tested with a few other regressors, including RandomForestRegressor and GradientBoostingRegressor and I don't see any difference.

I compared os.environ in both cases and they're the same. I looked at sklearn.base.get_config() and they're identical in both cases too. I notice that torch sets OMP_NUM_THREADS to 10, while without the torch import this value is set to 20 (on my machine with 20 cores). But even manually setting this to 10 doesn't bridge the gap.

I don't know enough about torch or sklearn to be able to work out what else is going on, I'm guessing someone who's worked on HistGradientBoostingRegressor might know what's going on? Seems like there's a nice performance gain to be found somewhere in here.

Steps/Code to Reproduce

As above

Expected Results

Should be max fast all the time.

Actual Results

Is not max fast unless I import torch.

Also as a general thing it would be nice to be able to pass n_jobs to the constructor. Having something use all 20 cores is not always the fastest way.

Versions

System:
    python: 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0]
executable: /home/davidg/.virtualenvs/learning/bin/python
   machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python dependencies:
      sklearn: 1.2.2
          pip: 23.1.2
   setuptools: 59.5.0
        numpy: 1.24.3
        scipy: 1.10.1
       Cython: 0.29.33
       pandas: 2.0.1
   matplotlib: 3.7.0
       joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 20
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
        version: None
    num_threads: 10
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 20
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 20
@davidgilbertson davidgilbertson added Bug Needs Triage Issue requires triage labels Jul 3, 2023
@ogrisel
Copy link
Member
ogrisel commented Jul 3, 2023

I suspect an interaction between openmp runtimes linked to the Python processes. Based on the threadpoolctl info: section of your report, it seems that torch is shipping a version of libgomp that is configured to use only 10 threads (which I suspect is your number of physical cores on your machine) instead of 20.

You can configure that your number of physical cores is 10 by evaluating the following in your Python env:

import joblib
print(joblib.cpu_count(only_physical_cores=True))

If my intuition is correct, this problem should have been fixed in scikit-learn 1.3.0: in this release we have changed the number of threads used by default in OpenMP/Cython code in scikit-learn to match the number of physical cores of the machine instead of the number of logical cores as is typically the default in OpenMP runtimes (see #26082 for details).

Please let us know if upgrading scikit-learn to 1.3.0 fixes the problem for you.

@davidgilbertson
Copy link
Contributor Author

Your intuition is amazing! I do indeed have a mere 10 physical cores (Today I Learned...), and upgrading to 1.3.0 brought the runtime down inline with the version that had torch installed.

Thanks for the quick reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

2 participants
0