Description
I observed that our OpenMP based GBDT does not scale well beyond 4 or 8 threads.
@SmirnovEgorRu who already contributed scalability improvements to xgboost sent me the following analysis by email, reproduced here with his permission:
I had initial looking at histogram building functions and observed a few things:
Scikit-learn’s code spends too much time in np.zeroes() functions. It is used for creation and filling histograms by zeroes. Theoretically – it is a cheap operation, but this is sequential and uses not native code. I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.
For low decision-tree nodes - # of rows can be very small < 1000. In this case creation of parallel loop brings more overhead than value. Similar issue was covered here - hcho3/xgboost-fast-hist-perf-lab#15, to understand background of the issue – you can read readme in the repository. I hope, you will find smth interesting for you there.
Link above also says about impact of conversion float <-> double into performance. I have looked at Scikit-learn’s assembler of histogram building and observed conversions too. I haven’t tested, but removing of this conversion can help.
If exclude np.zeroes() – building of histograms scales okay if we use only one processor. If we say about multi-socket systems (NUMA) – it scales very poor.
Here I can recommend you:
-
Split bin-matrix to <# of numa –nodes> parts by rows and allocate memory on each numa-node separately.
-
Pin threads to numa-nodes. Each thread should use bin-matrix only from pinned numa-node.
/cc @NicolasHug @jeremiedbb @adrinjalali @glemaitre @pierreglaser