.iterrows takes too long and generate large memory footprint

When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.

The name of the function implies that it is an iterator and should not take much to run. ~~However~~, ~~in the method it uses builtin method 'zip'~~, ~~which can sometimes generate huge temporary list of tuples if optimisation is not done correctly~~.

Below is the code which can reproduce the issue on a box with 16GB memory.

s1 = range(30000000)
s2 = np.random.randn(30000000)
ts = pd.date_range('20140101', freq='S', periods=30000000)
df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts)
for r in df.iterrows():
    break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions