8000 High memory usage in HistGradientBoostingClassifier · Issue #18152 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
High memory usage in HistGradientBoostingClassifier #18152
Closed
@Shihab-Shahriar

Description

@Shihab-Shahriar

Related to #16395

In a dataset with 10000 samples and 100 features, the memory usage is 1657 MB, for 200 features it is 3400MB, for 400 it is
6627 MB. In comparison, it is 95MB, 181MB and 356MB respectively for LightGBM.

Noticed this while trying to train on MNIST, program got killed by OS.

Here is the code I'm using:

X, y = make_classification(n_classes=2, n_samples=10_000, n_features=400)

hgb = HistGradientBoostingClassifier(
    max_iter=500,
    max_leaf_nodes=127,
    learning_rate=.1,
)

lg = lgb.LGBMClassifier(
    n_estimators=500,
    num_leaves=127,
    learning_rate=0.1,
    n_jobs=16
)


mems = memory_usage((hgb.fit, (X, y)))
print(f"{max(mems):.2f}, {max(mems) - min(mems):.2f} MB")  # 2nd value is reported above.

Both were running at 100% in a 8 core/16 thread CPU. Had similar results with version 0.23.1.

System Info:

System:
    python: 3.7.7 (default, Mar 26 2020, 15:48:22)  [GCC 7.3.0]
executable: /home/shihab/anaconda3/bin/python
   machine: Linux-5.8.0-050800-generic-x86_64-with-debian-bullseye-sid

Python dependencies:
          pip: 20.2.1
   setuptools: 49.2.0
      sklearn: 0.24.dev0
        numpy: 1.19.1
        scipy: 1.5.0
       Cython: 0.29.21
       pandas: 1.1.0
   matplotlib: 3.2.2
       joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0