Dask read_csv inside a loop leads threads and RAM to increase in every step

Hi, the following is basically the script I'm running

import dask.dataframe as dd

for folder in folders:
    my_bucket_folder = folder
    df = dd.read_csv('s3://{}/{}/*'.format(my_bucket, my_bucket_folder),
                     compression='gzip', 
                     dtype= 'object', # remember to convert to float/int
                     storage_options={'key': AWS_ACCESS_KEY_ID,
                                      'secret': AWS_SECRET_ACCESS_KEY})

    # do stuff 
    # .compute()
    # save results to parquet

What I noticed is that the the number of threads and the RAM are increasing step after step. This is really inconvenient plus the process eventually stops working.

How can I close all the threads at the end of every step? I ran a similar script using only pandas and I didn't experienced the same problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions