You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, the following is basically the script I'm running
import dask.dataframe as dd
for folder in folders:
my_bucket_folder = folder
df = dd.read_csv('s3://{}/{}/*'.format(my_bucket, my_bucket_folder),
compression='gzip',
dtype= 'object', # remember to convert to float/int
storage_options={'key': AWS_ACCESS_KEY_ID,
'secret': AWS_SECRET_ACCESS_KEY})
# do stuff
# .compute()
# save results to parquet
What I noticed is that the the number of threads and the RAM are increasing step after step. This is really inconvenient plus the process eventually stops working.
How can I close all the threads at the end of every step? I ran a similar script using only pandas and I didn't experienced the same problem.