8000 Memory mapping causes disk space usage bloat in big models with many search parameters · Issue #19608 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
< 9921 div class="Box-sc-g0xbh4-0 cEqqlB prc-PageHeader-TitleArea-jxJZy" data-component="TitleArea" data-size-variant="medium">Memory mapping causes disk space usage bloat in big models with many search parameters #19608
Open
@asansal-quantico

Description

@asansal-quantico

Hello,

We use GridSearchCV in a project where our training data is quite large (approx, 20,000 rows x 100 columns/features). We search a large space of hyperparameters, and with 5-10 cross-validation runs total number of model sometimes up to 1000.

When running default parallelization using joblib, the tool generates a lot of memory-mapped disk data and uses a lot of disk space on the scale of hundreds of gigabytes.

Solution:
Once we disable memory mapping, it runs successfully without using the disk. The solution is to call joblib's Parallel(..., max_nbytes=None) within GridSearchCV, see the link to code below. The default for joblib is '1M', which is 1 megabyte. When we pass None, it disables memory mapping.

Suggested improvement is to add a kwarg to GridSearchCV that gets passed to Parallel's max_nbytes.

parallel = Parallel(n_jobs=self.n_jobs,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0