8000 PERF: optimize memory usage for to_hdf by jreback · Pull Request #9648 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@jreback
Copy link
Contributor
@jreback jreback commented Mar 13, 2015

from here

reduce memeory usage necessary for using to_hdf

  • was copying always in astyping
  • was ravelling then reshaping
  • was constantly allocating a new chunked buffer, now re-uses the same buffer
In [1]: df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB

Previously

In [3]: %memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 11029.49 MiB, increment: 7130.57 MiB

With PR

In [2]: memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 4669.21 MiB, increment: 794.57 MiB

@jreback jreback added Performance Memory or execution speed performance IO HDF5 read_hdf, HDFStore labels Mar 13, 2015
@jreback jreback added this to the 0.16.0 milestone Mar 13, 2015
@jreback jreback force-pushed the pytables_memory branch 6 times, most recently from d0f8583 to 21e727d Compare March 15, 2015 22:38
jreback added a commit that referenced this pull request Mar 16, 2015
PERF: optimize memory usage for to_hdf
@jreback jreback merged commit 269af25 into pandas-dev:master Mar 16, 2015
@bwillers
Copy link
Contributor

If you happen to find yourself in new york I'm buying you a beer for this fix.

@jreback
Copy link
Contributor Author
jreback commented Mar 28, 2015

hahha I figured I broke it I should fix it

in Nyc, so anytime!

@tomanizer
Copy link

Thanks a lot for fixing this! It helps a lot.

@sagol
Copy link
sagol commented Apr 22, 2019

the bug is back

df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB

%memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 7934.20 MiB, increment: 3823.80 MiB

pd.__version__
'0.24.2'

With a more complex structure, everything is much worse.

data_ifa.info()
<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, b88d3b87-3432-43cc-8219-f45d97389d8f to eb705297-94e8-4ccf-a910-5f3b9734d572
Data columns (total 2 columns):
bundles        100000 non-null object
bundles_len    100000 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.3+ MB

%memit -r 1 data_ifa.to_hdf(full_file_name_hd5, key='data_ifa', encoding='utf-8', complevel=9, mode='w', format='table')
peak memory: 22106.07 MiB, increment: 21324.53 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

0