8000 Multicore scalability of the Histogram-based GBDT · Issue #14306 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Multicore scalability of the Histogram-based GBDT #14306
Open
@ogrisel

Description

@ogrisel

I observed that our OpenMP based GBDT does not scale well beyond 4 or 8 threads.

@SmirnovEgorRu who already contributed scalability improvements to xgboost sent me the following analysis by email, reproduced here with his permission:

I had initial looking at histogram building functions and observed a few things:

Scikit-learn’s code spends too much time in np.zeroes() functions. It is used for creation and filling histograms by zeroes. Theoretically – it is a cheap operation, but this is sequential and uses not native code. I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.

For low decision-tree nodes - # of rows can be very small < 1000. In this case creation of parallel loop brings more overhead than value. Similar issue was covered here - hcho3/xgboost-fast-hist-perf-lab#15, to understand background of the issue – you can read readme in the repository. I hope, you will find smth interesting for you there.

Link above also says about impact of conversion float <-> double into performance. I have looked at Scikit-learn’s assembler of histogram building and observed conversions too. I haven’t tested, but removing of this conversion can help.

If exclude np.zeroes() – building of histograms scales okay if we use only one processor. If we say about multi-socket systems (NUMA) – it scales very poor.

Here I can recommend you:

  • Split bin-matrix to <# of numa –nodes> parts by rows and allocate memory on each numa-node separately.

  • Pin threads to numa-nodes. Each thread should use bin-matrix only from pinned numa-node.

/cc @NicolasHug @jeremiedbb @adrinjalali @glemaitre @pierreglaser

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0