High memory usage in HistGradientBoostingClassifier

Related to #16395

In a dataset with 10000 samples and 100 features, the memory usage is 1657 MB, for 200 features it is 3400MB, for 400 it is
6627 MB. In comparison, it is 95MB, 181MB and 356MB respectively for LightGBM.

Noticed this while trying to train on MNIST, program got killed by OS.

Here is the code I'm using:

X, y = make_classification(n_classes=2, n_samples=10_000, n_features=400)

hgb = HistGradientBoostingClassifier(
    max_iter=500,
    max_leaf_nodes=127,
    learning_rate=.1,
)

lg = lgb.LGBMClassifier(
    n_estimators=500,
    num_leaves=127,
    learning_rate=0.1,
    n_jobs=16
)


mems = memory_usage((hgb.fit, (X, y)))
print(f"{max(mems):.2f}, {max(mems) - min(mems):.2f} MB")  # 2nd value is reported above.

Both were running at 100% in a 8 core/16 thread CPU. Had similar results with version 0.23.1.

System Info:

System:
    python: 3.7.7 (default, Mar 26 2020, 15:48:22)  [GCC 7.3.0]
executable: /home/shihab/anaconda3/bin/python
   machine: Linux-5.8.0-050800-generic-x86_64-with-debian-bullseye-sid

Python dependencies:
          pip: 20.2.1
   setuptools: 49.2.0
      sklearn: 0.24.dev0
        numpy: 1.19.1
        scipy: 1.5.0
       Cython: 0.29.21
       pandas: 1.1.0
   matplotlib: 3.2.2
       joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions