-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Hi team,
I have been having issues with pandas memory management. Specifically, there is an (at least for me) unavoidable peak of memory which occurs when attempting to remove variables from a data set. It should be (almost) free! I am getting rid of part of the data, but it still needs to allocate a big amount of memory producing MemoryErrors.
Just to give you a little bit of context, I am working with a DataFrame which contains 33M of rows and 500 columns (just a big one!), almost all of them numeric, in a machine with 360GB of RAM. The whole data set fits in memory and I can successfully apply some transformations to the variables. The problem comes when I need to drop a 10% of the columns contained in the table. It just produces a big peak of memory leading to a MemoryError
. Before performing this operation, there are more than 80GB of memory available!.
I tried to use the following methods for removing the columns and all of them failed.
drop()
with or withoutinplace
parameterpop()
reindex()
reindex_axis()
del df[column]
in a loop over the columns to be removed__delitem__(column)
in a loop over the columns to be removedpop()
anddrop()
in a loop over the columns to be removed.- I also tried to reasign the columns overwritting the data frame using indexing with
loc()
andiloc()
but it does not help.
I found that the drop method with inplace is the most efficient one but it still generates a huge peak.
I would like to discuss if there is there any way of implementing (or is it already implemented by any chance) a method for more efficiently removing variables without generating more memory consumption...
Thank you
Iván