Memory mapping causes disk space usage bloat in big models with many search parameters

Hello,

We use GridSearchCV in a project where our training data is quite large (approx, 20,000 rows x 100 columns/features). We search a large space of hyperparameters, and with 5-10 cross-validation runs total number of model sometimes up to 1000.

When running default parallelization using joblib, the tool generates a lot of memory-mapped disk data and uses a lot of disk space on the scale of hundreds of gigabytes.

Solution:
Once we disable memory mapping, it runs successfully without using the disk. The solution is to call joblib's Parallel(..., max_nbytes=None) within GridSearchCV, see the link to code below. The default for joblib is '1M', which is 1 megabyte. When we pass None, it disables memory mapping.

Suggested improvement is to add a kwarg to GridSearchCV that gets passed to Parallel's max_nbytes.

scikit-learn/sklearn/model_selection/_search.py

Line 767 in 28ee486

parallel = Parallel(n_jobs=self.n_jobs,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions