DatetimeIndex plot converter performance #17479

jondoesntgit · 2017-09-08T19:27:04Z

Code Sample, a copy-pastable example if possible

I have a detector with 432000 data points sampled at 10 Hz (12 hours of data). I want to plot the time trace using pandas.Series

import pandas as pd
import time

# This is how it is in my code, but you don't have it.
# data1 = load('../170907/data/1504817777.npy')

# You could just as easily do 
data1 = numpy.random.randn(432000)

date_index = pd.date_range(start=1504817777*1e9, periods=len(data1), freq='100 ms', tz='UTC')\
    .tz_convert('America/Los_Angeles')\
    .tz_localize(None)
voltage = pd.Series(data1, date_index)

voltage[::10000].plot() # This is only 5 points, but it takes a **long** time to render this

Problem description

I love that pandas handles a lot of the date handling automatically. However, it's not practical to spend 30 seconds waiting for this plot to render. I've tried striding over the data in order to decrease the amount of time (I don't need to see all 432000 points), but this doesn't seem to improve rendering time.

Am I missing something?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
httplib2: 0.10.3
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-09-08T19:40:27Z

Could you edit your example to just use the data1 = np.random.randn(432000)? We don't have your data file.

It looks like something buggy in our converter we register with matplotlib: https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_converter.py, if you want to dig into that (see if we're calling it too often maybe?)

As a workaround there's voltage[::10000].plot(x_compat=True), which disables that converter.

ghost · 2017-11-07T17:42:06Z

I profiled voltage[::10000].plot() call with cProfile and it seems like majority of the time is spent in pandas._libs.period.get_period_field_arr function.

In [4]: import pandas as pd
   ...: import time
   ...: import numpy
   ...: 
   ...: 
   ...: # This is how it is in my code, but you don't have it.
   ...: # data1 = load('../170907/data/1504817777.npy')
   ...: 
   ...: # You could just as easily do 
   ...: data1 = numpy.random.randn(432000)
   ...: 
   ...: date_index = pd.date_range(start=1504817777*1e9, periods=len(data1), freq='
   ...: 100 ms', tz='UTC')\
   ...:     .tz_convert('America/Los_Angeles')\
   ...:     .tz_localize(None)
   ...: voltage = pd.Series(data1, date_index)
   ...: 

In [5]: import cProfile, pstats

In [6]: cProfile.run('voltage[::10000].plot()', 'pandas.prof')

In [8]: p = pstats.Stats('pandas.prof')

In [12]: p.sort_stats('time').print_stats(10)
Tue Nov  7 18:36:36 2017    pandas.prof

         258887 function calls (249954 primitive calls) in 39.934 seconds

   Ordered by: internal time
   List reduced from 2650 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        8   36.270    4.534   36.270    4.534 {pandas._libs.period.get_period_field_arr}
        2    1.057    0.529    1.057    0.529 {built-in method _operator.mod}
        2    0.628    0.314    0.628    0.314 {method 'compress' of 'numpy.ndarray' objects}
        1    0.382    0.382   38.957   38.957 /home/kris/projects/pandas/pandas/plotting/_converter.py:491(_daily_finder)
        5    0.379    0.076    0.379    0.076 {method 'nonzero' of 'numpy.ndarray' objects}
        4    0.363    0.091    0.363    0.091 {built-in method _operator.sub}
        4    0.254    0.063    0.289    0.072 /home/kris/projects/pandas/pandas/core/indexes/period.py:713(shift)
        2    0.064    0.032    0.064    0.032 {built-in method _operator.eq}
       67    0.060    0.001    0.060    0.001 {built-in method numpy.core.multiarray.arange}
       96    0.054    0.001    0.054    0.001 {built-in method marshal.loads}

TomAugspurger · 2017-11-07T18:40:00Z

Yeah, this is a bit tricky; the calls to get_period_field_arr are from _daily_finder, but I can't easily extract reproducible example without involving matplotlib.

@tacaswell will you be a PyData NYC? Maybe we can sketch out a plan for finally getting these converters into matplotlib.

tacaswell

I will be! This PR matplotlib/matplotlib#9779 also just went in which deals with datetime64.

mroeschke · 2024-01-27T23:09:17Z

Locally this ran pretty fast now on main so I think the performance issue has been addressed in the meantime so going to close

TomAugspurger added Performance Memory or execution speed performance Visualization plotting Difficulty Intermediate labels Sep 8, 2017

TomAugspurger changed the title ~~import pandas as pd import time~~ DatetimeIndex plot converter performance Sep 8, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke closed this as completed Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DatetimeIndex plot converter performance #17479

DatetimeIndex plot converter performance #17479

INSTALLED VERSIONS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DatetimeIndex plot converter performance #17479

DatetimeIndex plot converter performance #17479

Comments

Uh oh!

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Output of `pd.show_versions()`