Fix PDFpages bug #9724

jklymak · 2017-11-09T00:00:13Z

PR Summary

Fixes #9716, I think.

Pre 433b899 code didn't call the finalize code path if the filename was a PdfPages instance.

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

jklymak · 2017-11-09T00:01:20Z

Using

import os

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from matplotlib.backends.backend_pdf import PdfPages

version = mpl.__version__

np.random.seed(0)
x = range(50)

size = {}
for f in [1, 10, 100]:
    fname = r"savefig_test_{} size {}_2.pdf".format(version, f)
    fig, ax = plt.subplots()
    pdf = PdfPages(fname)
    for i in range(f):
        ax.cla()
        y1 = np.random.randint(0, 100, size=50)
        y2 = np.random.randint(0, 100, size=50)
        ax.plot(x, y1, label="label_y1 (100%, 30%)")
        ax.plot(x, y2, label="label_y2 (90%, 80%)")
        ax.legend(loc='best')
        ax.set_title("chart with long title for testing: {}".format(i))
        fig.savefig(pdf, format='pdf')

    pdf.close()
    pdf = None
    size[f] = os.path.getsize(fname)/1024

for k, v in size.items():
    print("#charts: {:3} size(KB)={:8,.0f}  KB per chart={:5,.0f})"
          .format(k, v, v/k))

I get

#charts:   1 size(KB)=      14  KB per chart=   14)
#charts:  10 size(KB)=      32  KB per chart=    3)
#charts:  50 size(KB)=     111  KB per chart=    2)
#charts: 100 size(KB)=     210  KB per chart=    2)
#charts: 200 size(KB)=     409  KB per chart=    2)
#charts: 300 size(KB)=     607  KB per chart=    2)
#charts: 400 size(KB)=     804  KB per chart=    2)

PDFs open fine in Preview and Acrobat

afvincent · 2017-11-09T01:44:16Z

A bit out of my depth when it comes to the PDF backend, so I will not dare approving the PR 🐑. But FWIW, I reproduced locally @jklymak's fix and with it, the MWE that I used for bisecting in #9716 produces a PDF weighting ~8.6 kB, as 2.0.2 does (while 2.1.0 results in a ~13.7 kB file).

jklymak · 2017-11-09T01:47:37Z

@afvincent I'm sure you are less out of your depth than I am! All I did was look at what 2.0.2 did versus 2.1 wrt PdfPages and made this version do the same thing.

I did't test if the context awareness worked with this change, except all tests pass, and I hope there is a test for that functionality ;-)

afvincent · 2017-11-09T05:53:34Z

Quick (and maybe dumb) thought/suggestion: is there a point into transforming the snippet from above into some kind of test to avoid such a similar regression in the future? Something like saving the 50-chart case and checking that its size stays below let's say 256 kiB (or whatever other sensible arbitrary threshold).

(@jklymak I think that you are underestimating how little I know about all these I/Os and context managing questions ^^.)

afvincent · 2017-11-09T06:03:37Z

About the context awareness, a quick Ctrl+F for "with PdfPages(" in test_backend_pdf return 7 matches, so I guess that it is at least partly (implicitly) tested. An example of this is:

def test_multipage_pagecount():
    with PdfPages(io.BytesIO()) as pdf:
        assert pdf.get_pagecount() == 0
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.plot([1, 2, 3])
        fig.savefig(pdf, format="pdf")
        assert pdf.get_pagecount() == 1
        pdf.savefig()
        assert pdf.get_pagecount() == 2

jkseppan

The change looks correct to me. It would be great to have a test that exercises this code path better, but feel free to file that as a bug and assign it to me.

jklymak · 2017-11-09T06:59:12Z

OK, I'll try and add a test. Maybe I can do something intelligent with IOStream? Otherwise I'm not too sure how we write files in the test suite....

jkseppan · 2017-11-09T14:20:57Z

Some existing tests use tempfile.NamedTemporaryFile but pytest has a tmpdir fixture that can make it even easier to generate a name for a temporary file.

The problem here was that PdfFile.finalize was called multiple times. One way to write the test could be to ensure that finalize raises an error when called more than once. Then probably some existing test will fail without the fix.

tacaswell

I would had the pdf pages object a BytesIO stream, write out multiple pages and the make sure that b'Kids' only shows up once in the buffer (had plans to do that last night but did not get to it).

jklymak · 2017-11-09T18:47:04Z

Test added. Test fails on pre-patch code passes on this code....

jkseppan · 2017-11-10T08:30:34Z

lib/matplotlib/tests/test_backend_pdf.py

+            ax = fig.add_subplot(111)
+            fig.savefig(pdf, format="pdf")
+    pdfio.seek(0)
+    assert sum(b'Kids' in line for line in pdfio) == 1


The PDF specification recommends arranging the page objects in a tree with Kids and Parent pointers, for fast retrieval. Currently we just make a one-level tree, but if this is ever fixed, it would break this test. How about checking for startxref instead?

I'm happy to punt improving this test to be part of that work.

I'm fine with merging this as is, but I think changing 'Kids' to 'startxref' here would still be correct and would be more robust against future changes.

I'm fine w that, but as below, what about just checking the size is small?

Oh, I miss-understood the suggestion, thought 'startxref' was some sort of xml parsing or something (sorry not enough coffee yet).

👍 to changing the search term to 'startxref'

jklymak · 2017-11-10T15:50:08Z

Hmmm, how about we go back to the original suggestion of checking the buffer size? After all, we don't really want to enforce a file structure, we just want to make sure it isn't ballooning in size unreasonably..

tacaswell · 2017-11-10T16:05:29Z

I like checking the document structure better than checking file size.

jklymak · 2017-11-10T16:48:01Z

OK, is there a reason to not do both?

Logic here is that I can imagine other reasons the PDF file might balloon in size (font caching?) and this would at least force someone to change the test if that happened.

jklymak · 2017-11-10T17:02:21Z

lib/matplotlib/tests/test_backend_pdf.py

+            ax = fig.add_subplot(111)
+            ax.set_title('This is a long title')
+            fig.savefig(pdf, format="pdf")
+    assert sum(b'startxref' in line for line in pdfio) == 1


Doing both tests now

tacaswell · 2017-11-10T21:24:41Z

Thanks @jklymak !

Very glad this got fixed in time for 2.1.1

Backport PR #9724 on branch v2.1.x

jklymak added this to the v2.1.1 milestone Nov 9, 2017

jklymak requested a review from jkseppan November 9, 2017 00:03

jklymak added backend: pdf Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. labels Nov 9, 2017

jklymak mentioned this pull request Nov 9, 2017

Large size of plots saved as pdf #9716

Closed

jkseppan approved these changes Nov 9, 2017

View reviewed changes

jkseppan reviewed Nov 10, 2017

View reviewed changes

tacaswell approved these changes Nov 10, 2017

View reviewed changes

Fix PDFpages bug

5f654dc

jklymak force-pushed the fixPDFbug branch from 2aafc6d to 687fe34 Compare November 10, 2017 17:01

jklymak commented Nov 10, 2017

View reviewed changes

jklymak force-pushed the fixPDFbug branch 2 times, most recently from 02cf299 to 570756c Compare November 10, 2017 18:38

TST: Check just one startxref and file size not ballooning

621393b

jklymak force-pushed the fixPDFbug branch from 570756c to 621393b Compare November 10, 2017 19:00

tacaswell merged commit d8154cb into matplotlib:master Nov 10, 2017

lumberbot-app bot pushed a commit that referenced this pull request Nov 10, 2017

Backport PR #9724: Fix PDFpages bug

0ba0b2a

lumberbot-app bot mentioned this pull request Nov 10, 2017

Backport PR #9724 on branch v2.1.x #9746

Merged

dstansby added a commit that referenced this pull request Nov 11, 2017

Merge pull request #9746 from matplotlib/auto-backport-of-pr-9724

c699513

Backport PR #9724 on branch v2.1.x

jklymak mentioned this pull request Nov 20, 2017

Multi-page PDF file size jumps since 2.0.0 #9819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix PDFpages bug #9724

Fix PDFpages bug #9724

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix PDFpages bug #9724

Fix PDFpages bug #9724

Uh oh!

Conversation

PR Summary

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!