8000 Dask read_csv inside a loop leads threads and RAM to increase in every step · Issue #3670 · dask/dask · GitHub
[go: up one dir, main page]

Skip to content
Dask read_csv inside a loop leads threads and RAM to increase in every step #3670
@rpanai

Description

@rpanai

Hi, the following is basically the script I'm running

import dask.dataframe as dd

for folder in folders:
    my_bucket_folder = folder
    df = dd.read_csv('s3://{}/{}/*'.format(my_bucket, my_bucket_folder),
                     compression='gzip', 
                     dtype= 'object', # remember to convert to float/int
                     storage_options={'key': AWS_ACCESS_KEY_ID,
                                      'secret': AWS_SECRET_ACCESS_KEY})

    # do stuff 
    # .compute()
    # save results to parquet

What I noticed is that the the number of threads and the RAM are increasing step after step. This is really inconvenient plus the process eventually stops working.

How can I close all the threads at the end of every step? I ran a similar script using only pandas and I didn't experienced the same problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0