8000 DatetimeIndex plot converter performance · Issue #17479 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

DatetimeIndex plot converter performance #17479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jondoesntgit opened this issue Sep 8, 2017 · 5 comments
Closed

DatetimeIndex plot converter performance #17479

jondoesntgit opened this issue Sep 8, 2017 · 5 comments
Labels
Performance Memory or execution speed performance Visualization plotting

Comments

@jondoesntgit
Copy link
jondoesntgit commented Sep 8, 2017

Code Sample, a copy-pastable example if possible

I have a detector with 432000 data points sampled at 10 Hz (12 hours of data). I want to plot the time trace using pandas.Series

import pandas as pd
import time

# This is how it is in my code, but you don't have it.
# data1 = load('../170907/data/1504817777.npy')

# You could just as easily do 
data1 = numpy.random.randn(432000)

date_index = pd.date_range(start=1504817777*1e9, periods=len(data1), freq='100 ms', tz='UTC')\
    .tz_convert('America/Los_Angeles')\
    .tz_localize(None)
voltage = pd.Series(data1, date_index)

voltage[::10000].plot() # This is only 5 points, but it takes a **long** time to render this

Problem description

I love that pandas handles a lot of the date handling automatically. However, it's not practical to spend 30 seconds waiting for this plot to render. I've tried striding over the data in order to decrease the amount of time (I don't need to see all 432000 points), but this doesn't seem to improve rendering time.

Am I missing something?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
httplib2: 0.10.3
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Could you edit your example to just use the data1 = np.random.randn(432000)? We don't have your data file.

It looks like something buggy in our converter we register with matplotlib: https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_converter.py, if you want to dig into that (see if we're calling it too often maybe?)

As a workaround there's voltage[::10000].plot(x_compat=True), which disables that converter.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Visualization plotting Difficulty Intermediate labels Sep 8, 2017
@TomAugspurger TomAugspurger changed the title import pandas as pd import time DatetimeIndex plot converter performance Sep 8, 2017
@ghost
Copy link
ghost commented Nov 7, 2017

I profiled voltage[::10000].plot() call with cProfile and it seems like majority of the time is spent in pandas._libs.period.get_period_field_arr function.

In [4]: import pandas as pd
   ...: import time
   ...: import numpy
   ...: 
   ...: 
   ...: # This is how it is in my code, but you don't have it.
   ...: # data1 = load('../170907/data/1504817777.npy')
   ...: 
   ...: # You could just as easily do 
   ...: data1 = numpy.random.randn(432000)
   ...: 
   ...: date_index = pd.date_range(start=1504817777*1e9, periods=len(data1), freq='
   ...: 100 ms', tz='UTC')\
   ...:     .tz_convert('America/Los_Angeles')\
   ...:     .tz_localize(None)
   ...: voltage = pd.Series(data1, date_index)
   ...: 

In [5]: import cProfile, pstats

In [6]: cProfile.run('voltage[::10000].plot()', 'pandas.prof')

In [8]: p = pstats.Stats('pandas.prof')

In [12]: p.sort_stats('time').print_stats(10)
Tue Nov  7 18:36:36 2017    pandas.prof

         258887 function calls (249954 primitive calls) in 39.934 seconds

   Ordered by: internal time
   List reduced from 2650 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        8   36.270    4.534   36.270    4.534 {pandas._libs.period.get_period_field_arr}
        2    1.057    0.529    1.057    0.529 {built-in method _operator.mod}
        2    0.628    0.314    0.628    0.314 {method 'compress' of 'numpy.ndarray' objects}
        1    0.382    0.382   38.957   38.957 /home/kris/projects/pandas/pandas/plotting/_converter.py:491(_daily_finder)
        5    0.379    0.076    0.379    0.076 {method 'nonzero' of 'numpy.ndarray' objects}
        4    0.363    0.091    0.363    0.091 {built-in method _operator.sub}
        4    0.254    0.063    0.289    0.072 /home/kris/projects/pandas/pandas/core/indexes/period.py:713(shift)
        2    0.064    0.032    0.064    0.032 {built-in method _operator.eq}
       67    0.060    0.001    0.060    0.001 {built-in method numpy.core.multiarray.arange}
       96    0.054    0.001    0.054    0.001 {built-in method marshal.loads}

@TomAugspurger
Copy link
Contributor

Yeah, this is a bit tricky; the calls to get_period_field_arr are from _daily_finder, but I can't easily extract reproducible example without involving matplotlib.

@tacaswell will you be a PyData NYC? Maybe we can sketch out a plan for finally getting these converters into matplotlib.

@tacaswell
Copy link
Contributor

I will be! This PR matplotlib/matplotlib#9779 also just went in which deals with datetime64.

@mroeschke
Copy link
Member

Locally this ran pretty fast now on main so I think the performance issue has been addressed in the meantime so going to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Visualization plotting
Projects
None yet
Development

No branches or pull requests

5 participants
0