-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
PerformanceMemory or execution speed performanceMemory or execution speed performance
Milestone
Description
When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.
The name of the function implies that it is an iterator and should not take much to run. However, in the method it uses builtin method 'zip', which can sometimes generate huge temporary list of tuples if optimisation is not done correctly.
Below is the code which can reproduce the issue on a box with 16GB memory.
s1 = range(30000000)
s2 = np.random.randn(30000000)
ts = pd.date_range('20140101', freq='S', periods=30000000)
df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts)
for r in df.iterrows():
break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory
Metadata
Metadata
Assignees
Labels
PerformanceMemory or execution speed performanceMemory or execution speed performance