-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix PDFpages bug #9724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PDFpages bug #9724
Conversation
Using import os
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
version = mpl.__version__
np.random.seed(0)
x = range(50)
size = {}
for f in [1, 10, 100]:
fname = r"savefig_test_{} size {}_2.pdf".format(version, f)
fig, ax = plt.subplots()
pdf = PdfPages(fname)
for i in range(f):
ax.cla()
y1 = np.random.randint(0, 100, size=50)
y2 = np.random.randint(0, 100, size=50)
ax.plot(x, y1, label="label_y1 (100%, 30%)")
ax.plot(x, y2, label="label_y2 (90%, 80%)")
ax.legend(loc='best')
ax.set_title("chart with long title for testing: {}".format(i))
fig.savefig(pdf, format='pdf')
pdf.close()
pdf = None
size[f] = os.path.getsize(fname)/1024
for k, v in size.items():
print("#charts: {:3} size(KB)={:8,.0f} KB per chart={:5,.0f})"
.format(k, v, v/k)) I get
PDFs open fine in Preview and Acrobat |
@afvincent I'm sure you are less out of your depth than I am! All I did was look at what 2.0.2 did versus 2.1 wrt I did't test if the context awareness worked with this change, except all tests pass, and I hope there is a test for that functionality ;-) |
Quick (and maybe dumb) thought/suggestion: is there a point into transforming the snippet from above into some kind of test to avoid such a similar regression in the future? Something like saving the 50-chart case and checking that its size stays below let's say 256 kiB (or whatever other sensible arbitrary threshold). (@jklymak I think that you are underestimating how little I know about all these I/Os and context managing questions ^^.) |
About the context awareness, a quick Ctrl+F for "with PdfPages(" in def test_multipage_pagecount():
with PdfPages(io.BytesIO()) as pdf:
assert pdf.get_pagecount() == 0
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot([1, 2, 3])
fig.savefig(pdf, format="pdf")
assert pdf.get_pagecount() == 1
pdf.savefig()
assert pdf.get_pagecount() == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks correct to me. It would be great to have a test that exercises this code path better, but feel free to file that as a bug and assign it to me.
OK, I'll try and add a test. Maybe I can do something intelligent with |
Some existing tests use The problem here was that |
I would had the pdf pages object a BytesIO stream, write out multiple pages and the make sure that |
Test added. Test fails on pre-patch code passes on this code.... |
ax = fig.add_subplot(111) | ||
fig.savefig(pdf, format="pdf") | ||
pdfio.seek(0) | ||
assert sum(b'Kids' in line for line in pdfio) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PDF specification recommends arranging the page objects in a tree with Kids
and Parent
pointers, for fast retrieval. Currently we just make a one-level tree, but if this is ever fixed, it would break this test. How about checking for startxref
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to punt improving this test to be part of that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with merging this as is, but I think changing 'Kids'
to 'startxref'
here would still be correct and would be more robust against future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine w that, but as below, what about just checking the size is small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I miss-understood the suggestion, thought 'startxref'
was some sort of xml parsing or something (sorry not enough coffee yet).
👍 to changing the search term to 'startxref'
Hmmm, how about we go back to the original suggestion of checking the buffer size? After all, we don't really want to enforce a file structure, we just want to make sure it isn't ballooning in size unreasonably.. |
I like checking the document structure better than checking file size. |
OK, is there a reason to not do both? Logic here is that I can imagine other reasons the PDF file might balloon in size (font caching?) and this would at least force someone to change the test if that happened. |
ax = fig.add_subplot(111) | ||
ax.set_title('This is a long title') | ||
fig.savefig(pdf, format="pdf") | ||
assert sum(b'startxref' in line for line in pdfio) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing both tests now
02cf299
to
570756c
Compare
Thanks @jklymak ! Very glad this got fixed in time for 2.1.1 |
Backport PR #9724 on branch v2.1.x
PR Summary
Fixes #9716, I think.
Pre 433b899 code didn't call the
finalize
code path if the filename was a PdfPages instance.PR Checklist