8000 Fix PDFpages bug by jklymak · Pull Request #9724 · matplotlib/matplotlib · GitHub
[go: up one dir, main page]

Skip to content

Fix PDFpages bug #9724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 10, 2017
Merged

Fix PDFpages bug #9724

merged 2 commits into from
Nov 10, 2017

Conversation

jklymak
Copy link
Member
@jklymak jklymak commented Nov 9, 2017

PR Summary

Fixes #9716, I think.

Pre 433b899 code didn't call the finalize code path if the filename was a PdfPages instance.

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@jklymak
Copy link
Member Author
jklymak commented Nov 9, 2017

Using

import os

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from matplotlib.backends.backend_pdf import PdfPages

version = mpl.__version__

np.random.seed(0)
x = range(50)

size = {}
for f in [1, 10, 100]:
    fname = r"savefig_test_{} size {}_2.pdf".format(version, f)
    fig, ax = plt.subplots()
    pdf = PdfPages(fname)
    for i in range(f):
        ax.cla()
        y1 = np.random.randint(0, 100, size=50)
        y2 = np.random.randint(0, 100, size=50)
        ax.plot(x, y1, label="label_y1 (100%, 30%)")
        ax.plot(x, y2, label="label_y2 (90%, 80%)")
        ax.legend(loc='best')
        ax.set_title("chart with long title for testing: {}".format(i))
        fig.savefig(pdf, format='pdf')

    pdf.close()
    pdf = None
    size[f] = os.path.getsize(fname)/1024

for k, v in size.items():
    print("#charts: {:3} size(KB)={:8,.0f}  KB per chart={:5,.0f})"
          .format(k, v, v/k))

I get

#charts:   1 size(KB)=      14  KB per chart=   14)
#charts:  10 size(KB)=      32  KB per chart=    3)
#charts:  50 size(KB)=     111  KB per chart=    2)
#charts: 100 size(KB)=     210  KB per chart=    2)
#charts: 200 size(KB)=     409  KB per chart=    2)
#charts: 300 size(KB)=     607  KB per chart=    2)
#charts: 400 size(KB)=     804  KB per chart=    2)

PDFs open fine in Preview and Acrobat

@jklymak jklymak added this to the v2.1.1 milestone Nov 9, 2017
@jklymak jklymak requested a review from jkseppan November 9, 2017 00:03
@jklymak jklymak added backend: pdf Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. labels Nov 9, 2017
@afvincent
Copy link
Contributor

A bit out of my depth when it comes to the PDF backend, so I will not dare approving the PR 🐑. But FWIW, I reproduced locally @jklymak's fix and with it, the MWE that I used for bisecting in #9716 produces a PDF weighting ~8.6 kB, as 2.0.2 does (while 2.1.0 results in a ~13.7 kB file).

@jklymak
Copy link
Member Author
jklymak commented Nov 9, 2017

@afvincent I'm sure you are less out of your depth than I am! All I did was look at what 2.0.2 did versus 2.1 wrt PdfPages and made this version do the same thing.

I did't test if the context awareness worked with this change, except all tests pass, and I hope there is a test for that functionality ;-)

@afvincent
Copy link
Contributor

Quick (and maybe dumb) thought/suggestion: is there a point into transforming the snippet from above into some kind of test to avoid such a similar regression in the future? Something like saving the 50-chart case and checking that its size stays below let's say 256 kiB (or whatever other sensible arbitrary threshold).

(@jklymak I think that you are underestimating how little I know about all these I/Os and context managing questions ^^.)

@afvincent
Copy link
Contributor

About the context awareness, a quick Ctrl+F for "with PdfPages(" in test_backend_pdf return 7 matches, so I guess that it is at least partly (implicitly) tested. An example of this is:

def test_multipage_pagecount():
    with PdfPages(io.BytesIO()) as pdf:
        assert pdf.get_pagecount() == 0
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.plot([1, 2, 3])
        fig.savefig(pdf, format="pdf")
        assert pdf.get_pagecount() == 1
        pdf.savefig()
        assert pdf.get_pagecount() == 2

Copy link
Member
@jkseppan jkseppan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks correct to me. It would be great to have a test that exercises this code path better, but feel free to file that as a bug and assign it to me.

@jklymak
Copy link
Member Author
jklymak commented Nov 9, 2017

OK, I'll try and add a test. Maybe I can do something intelligent with IOStream? Otherwise I'm not too sure how we write files in the test suite....

@jkseppan
Copy link
Member
jkseppan commented Nov 9, 2017

Some existing tests use tempfile.NamedTemporaryFile but pytest has a tmpdir fixture that can make it even easier to generate a name for a temporary file.

The problem here was that PdfFile.finalize was called multiple times. One way to write the test could be to ensure that finalize raises an error when called more than once. Then probably some existing test will fail without the fix.

@tacaswell
Copy link
Member

I would had the pdf pages object a BytesIO stream, write out multiple pages and the make sure that b'Kids' only shows up once in the buffer (had plans to do that last night but did not get to it).

@jklymak
Copy link
Member Author
jklymak commented Nov 9, 2017

Test added. Test fails on pre-patch code passes on this code....

ax = fig.add_subplot(111)
fig.savefig(pdf, format="pdf")
pdfio.seek(0)
assert sum(b'Kids' in line for line in pdfio) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDF specification recommends arranging the page objects in a tree with Kids and Parent pointers, for fast retrieval. Currently we just make a one-level tree, but if this is ever fixed, it would break this test. How about checking for startxref instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to punt improving this test to be part of that work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with merging this as is, but I think changing 'Kids' to 'startxref' here would still be correct and would be more robust against future changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine w that, but as below, what about just checking the size is small?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I miss-understood the suggestion, thought 'startxref' was some sort of xml parsing or something (sorry not enough coffee yet).

👍 to changing the search term to 'startxref'

@jklymak
Copy link
Member Author
jklymak commented Nov 10, 2017

Hmmm, how about we go back to the original suggestion of checking the buffer size? After all, we don't really want to enforce a file structure, we just want to make sure it isn't ballooning in size unreasonably..

@tacaswell
Copy link
Member

I like checking the document structure better than checking file size.

@jklymak
Copy link
Member Author
jklymak commented Nov 10, 2017

OK, is there a reason to not do both?

Logic here is that I can imagine other reasons the PDF file might balloon in size (font caching?) and this would at least force someone to change the test if that happened.

ax = fig.add_subplot(111)
ax.set_title('This is a long title')
fig.savefig(pdf, format="pdf")
assert sum(b'startxref' in line for line in pdfio) == 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing both tests now

@jklymak jklymak force-pushed the fixPDFbug branch 2 times, most recently from 02cf299 to 570756c Compare November 10, 2017 18:38
@tacaswell tacaswell merged commit d8154cb into matplotlib:master Nov 10, 2017
lumberbot-app bot pushed a commit that referenced this pull request Nov 10, 2017
@tacaswell
Copy link
Member

Thanks @jklymak !

Very glad this got fixed in time for 2.1.1

dstansby added a commit that referenced this pull request Nov 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: pdf Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large size of plots saved as pdf
4 participants
0