8000 [RFC] Speed up Axes.bar by ivirshup · Pull Request #20092 · matplotlib/matplotlib · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Speed up Axes.bar #20092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

[RFC] Speed up Axes.bar #20092

wants to merge 3 commits into from

Conversation

ivirshup
Copy link
@ivirshup ivirshup commented Apr 27, 2021

PR Summary

Currently Axes.bar calls self.add_patch for every bar. This can be quite slow for dense histograms. This PR instead puts the patches into a collection and adds that. This seems to give a large performance increase.

Details

Evidence of a speed up

from matplotlib import pyplot as plt
import numpy as np

N = 10_000

x = np.arange(N)
y = np.random.randn(N)

%time plt.bar(x, y)

On master:

CPU times: user 4.97 s, sys: 64.2 ms, total: 5.03 s
Wall time: 5.06 s

On this branch:

CPU times: user 1.42 s, sys: 27.1 ms, total: 1.44 s
Wall time: 1.47 s

More thoughts

I don't think this counts as an API change. Of the failing tests, all the image comparison ones I've looked at so far don't look different by eye. Some tests fail because they are checking for artists which are now in a different place.

I thought it would be good to get feedback on whether this was a reasonable approach before making more changes. Any thoughts?

TODO

Based on the failing tests

  • Color patches correctly
  • Get legends working (not sure how patches get labels attached)
  • Figure out what's going on with misaligned axis in test_default_edges (bar plot is just floating off the bottom axis)
  • Fix log scaling in hist

PR Checklist

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (run flake8 on changed files to check).
  • New features are documented, with examples if plot related.
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • Conforms to Matplotlib style conventions (install flake8-docstrings and run flake8 --docstring-convention=all).
  • New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

Copy link
@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening your first PR into Matplotlib!

If you have not heard from us in a while, please feel free to ping @matplotlib/developers or anyone who has commented on the PR. Most of our reviewers are volunteers and sometimes things fall through the cracks.

You can also join us on gitter for real-time discussion.

For details on testing, writing docs, and our review process, please see the developer guide

We strive to be a welcoming and open project. Please follow our Code of Conduct.

@jklymak
Copy link
Member
jklymak commented Apr 27, 2021

Thanks! I think this idea has been discussed before. I'm not sure of the technical issues, but in general it seems reasonable.

Some of your image changes are quite large according to the tests so it would be good to see before and after and decide what is wrong. The way to do that is to check in the new images as part of the PR (in tests/baseline_images) and we can see the difference.

@mwaskom
Copy link
mwaskom commented Apr 27, 2021

FWIW this will also break all of the seaborn tests that assert the correctness of bar/hist plots by checking the attributes of the patch artists.

I don't know if there is some way to keep a "dummy" reference to artists in their original place that does not trigger a duplicate draw ... could be useful?

@ivirshup
Copy link
Author

@jklymak thanks for the quick look! I've added a TODO to the top of the PR with the problems I've noticed in the test plots.

Currently looking at legends for hist. Not totally sure how to approach it, but am currently thinking of modifying this part of Axes.hist

        # If None, make all labels None (via zip_longest below); otherwise,
        # cast each element to str, but keep a single str as it.
        labels = [] if label is None else np.atleast_1d(np.asarray(label, str))
        for patch, lbl in itertools.zip_longest(patches, labels):
            if patch:
                p = patch[0]
                p.update(kwargs)
                if lbl is not None:
                    p.set_label(lbl)

And instead adding a new something (Artist I guess?) to the plot from the patches that would have been assigned labels.

example
        labels = [] if label is None else np.atleast_1d(np.asarray(label, str))
        legend_patches = []
        for patch, lbl in itertools.zip_longest(patches, labels):
            if patch:
                p = patch[0]
                p.update(kwargs)
                if lbl is not None:
                    p.set_label(lbl)
                    legend_patches.append(p)
                for p in patch[1:]:
                    p.update(kwargs)
                    p.set_label('_nolegend_')

        self.add_somehow(legend_patches)

Is there a more elegant way to do this?


@mwaskom, is it possible to expand the collections into lists of patches? Maybe a helper around this?

@tacaswell
Copy link
Member

Changing the return type is definitively an API change. Any user who is trying to do something with the returned is going to have their code broken.

You might be able to sub-class the Collection and make it look enough like the current return type, but it will be a lot of work as things like

bars = ax.bar(....)
bars[1].set_color('green')
bars[2].set_visible(False)
bars[3].set_linewidth(15)
*_, last, two = bars

all should keep working.


If your problem is primarily histograms @ivirshup you may do better either passing histtype='step' or using the new in mpl34 ax.stairs method. If you really want the bar look on histograms adding a new histtype (and accepting a bit more opt-in type instability in hist) may be easier to get done than changing the return type of all bar calls.

@jklymak
Copy link
Member
jklymak commented Apr 28, 2021

maybe we could have ax.bar(..., quick=True)? Though ax.histbar isn't a bad idea.

@ivirshup
Copy link
Author

@tacaswell, my thing was mainly that I didn't want to have to consider the size of my data and choose the keyword arguments. Particularly as a library author, having changes to the plots around arbitrary data size cutoffs is not going to be fun to maintain. Also I think people generally are expecting something that looks like bars when the bins are wide enough to be able to tell.

I think your point is totally fair. I wasn't really sure what was considered to be key parts of the API though, since arguably any changes to public attributes of the object is breaking. I wasn't planning on changing the return type, just it arguably wouldn't be a particularly useful value. Basically, my thinking was if changing the contents of ax.containers isn't breaking, is changing the behaviour of res = ax.bar(...); res[0].set_facecolor(...) breaking? But also, if the first is breaking, it would be difficult to make any improvements to this library.

TBH, if this got rejected here, my plan was to see if @mwaskom would be up for replacing the use of ax.bar in seaborn with some Collections, since initial tests make it look like this will cut plot time for large histograms by at least 50%.


From looking at the code a bit, I was actually wondering if the plan was to deprecate the containers module, since it's only used in a small subset of the library (specifically for bar and stem charts, I think). A little confusingly, some things that are annotated as Container actually expect subclasses of ToolContainerBase which is unrelated.

container : Container
`backend_bases.ToolContainerBase` object that will get the tools added.

My point here being, maybe this would be worth it for 4.0? It can be quite slow at the moment, and the machinery for making it fast is all here. Using a Collection instead of a Container would also bring this function more in line with the rest of package.

@timhoffm
Copy link
Member

Every change that can break or change the result of currently valid user code is a breaking change to us.

if the first is breaking, it would be difficult to make any improvements to this library.

Unfortunately, it is. This is the downside of a very large user base. We have to be extermely careful with these.

Using a Collection for the bars is reasonable. Probably the way forward is similar to hist(..., histtype=...): Add a parameter to make the Artist configurable. The default stays on isolated bars, using a collection is opt-in. The default can be switched later.

A clever in between way may be returning a proxy in container.patch, that will resolve the PatchCollection into individual rectangles when trying to access individual elements. If the user does not mess with individual rectangles, he'll get the speedup, otherwise he'll get the current behavior. This is also not trivial to get right, but seems more managable than trying to write a Rectangle adapter mapping to the collection.

@QuLogic QuLogic modified the milestones: v3.5.0, v3.6.0 Aug 23, 2021
@timhoffm
Copy link
Member
timhoffm commented May 1, 2022

@ivirshup are you still interested in working on this (under the constraints of backward compatibility and the paths laid out in my last comment)?

@timhoffm timhoffm modified the milestones: v3.6.0, unassigned May 1, 2022
@ivirshup
Copy link
Author
ivirshup commented May 4, 2022

@timhoffm, I don't think so. Many of the solutions suggested above got fairly involved, and I don't think I'm familiar enough with matplotlib internals to do this in a reasonable timeframe.

Feel free to close, though I think the conversation here definitely has value.

@timhoffm
Copy link
Member
timhoffm commented May 4, 2022

@ivirshup thanks for the feedback. I've created issue #22976 with a concise description of the suggested solution. Therefore, I close this thread.

@timhoffm timhoffm closed this May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0