8000 Consistently use realpaths to build XObject names by sauerburger · Pull Request #15629 · matplotlib/matplotlib · GitHub
[go: up one dir, main page]

Skip to content

Consistently use realpaths to build XObject names #15629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sauerburger
Copy link
Member

PR Summary

This PR addresses an issue in the PDF backend which lead to invalid PDFs. The PDF backend used two inconsistent ways to build the name for XObjects required for Unicode glyphs above code point 255.

  • The first method is based on the file name of the font returned by font_manager.findfont()`. This method is used to refer to the XObject glyphs when drawing text.
  • When the XObjects are created inside writeFonts() the realpath of the font returned by cbook.get_realpath_and_stat() is used instead.

The realpath differs from the file name of the font if the font is a symbolic link. In this case, the produced PDF contains invalid references to XObjects.

This commit adds a safe-guard in PdfFile._get_xobject_symbol_name(). The returned XObject names are always based on the realpath, even if the supplied path is a link. This achieves consistent XObject names and prevents invalid PDF files.

This PR resolves #15628.

PR Checklist

  • Has Pytest style unit tests
  • Code is Flake 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

Before this comment, the PDF backend used two inconsistent ways to build the
name for XObjects required for Unicode glyphs above code point 255. The first
method is based on the file name of the font returned by
`font_manager.findfont()`. This method is used to refer to the XObject glyphs
when drawing text. When the XObjects are created inside `writeFonts()` the
realpath of the font returned by `cbook.get_realpath_and_stat()` is used
instead. The realpath differs from the file name of the font if the font is a
symbolic link. In this case, the produced PDF contains invalid references to
XObjects.

This commit adds a safe-guard in `PdfFile._get_xobject_symbol_name()`. The
returned XObject names are always based on the realpath, even if the supplied
path is a link.  This achieves consistent XObject names and prevents invalid PDF
files.
@QuLogic
Copy link
Member
QuLogic commented Nov 8, 2019

This does fix it, but do you know if all callers of _get_xobject_symbol_name process filenames in this way?

@sauerburger
Copy link
Member Author

The method _get_xobject_symbol_name is used in three places:

  • Within embedTTF()/embedTTFType3(): embedTTF() gets the filename from its caller, writeFonts(), which uses the realpath. So, _get_xobject_symbol_name is called with the realpath already. The PR has not effect here.

  • Within draw_text(): Here, the _get_xobject_symbol_name is called with font.fname, which could be a link. With the fix, the _get_xobject_symbol_name, would resolve the link to its realpath.

  • Within draw_mathtext(): Here, it is harder to trace. MathTextParser creates a list of glyphs and their file names. As far as I can tell this is also based on font.fname, which again could be a link. With the fix, the _get_xobject_symbol_name, would resolve the link to its realpath.

Since the _get_xobject_symbol_name doesn't modify its arguments nor returns the path, it should not affect anything else besides the XObject names.

@anntzer
Copy link
Contributor
anntzer commented Nov 8, 2019

While the fix seems reasonable, do we really need to resolve font realpaths? Can't we just use fonts under whatever path they are found by findfont(), without ever resolving them? Note that for a given request, findfont() should not ever return the same font under two different (symlinked) paths, so the only case we can end up with the same font embedded twice is if the user specifies two symlinks to the same font by explicit paths -- which should be rare enough, and just results in a file with the same font embedded under two different names, i.e. just a missed optimization on the file size.

@QuLogic
Copy link
Member
QuLogic commented Nov 8, 2019

Looking at the history, two symlinks pointing to the same file was exactly why this link-resolving code was added in the first place: c8ae4e2. Has findfont changed since then to make this redundant?

@anntzer
Copy link
Contributor
anntzer commented Nov 8, 2019
  1. findfont could well call realpath() itself on its output, saving the need for the callers to do so (but this would fail with hardlinks -- in fact the use of stat() rather than realpath() in the linked commit suggests that it intended to handle hardlinks as well).
  2. the linked commit does not explain why this is an issue.

@anntzer
Copy link
Contributor
anntzer commented Nov 8, 2019

Actually I just realized that I already implemented my suggestion as #15320; @sauerburger can you check whether that PR fixes your issue?

@tacaswell tacaswell added this to the v3.3.0 milestone Nov 8, 2019
@QuLogic
Copy link
Member
QuLogic commented Nov 11, 2019

the linked commit does not explain why this is an issue.

Well, it came in around the same time as subsetting itself, so I assume the point is to make the subset as minimal as possible

@anntzer
Copy link
Contributor
anntzer commented Nov 11, 2019

I simply don't believe the same font file being accessed from two different paths occurs frequently enough (or at all...) to make the path canonicalization worthwhile. One can for example check that luatex/xetex (for which an important selling point is indeed better font support than pdftex...) don't bother with this optimization: compiling

\documentclass{article}
\usepackage{fontspec}
\begin{document}
\setmainfont{DejaVuSans.ttf} Hello
\setmainfont{times.ttf} world!
\setmainfont{altdejavu.ttf} And hello
\setmainfont{times.ttf} again!
\end{document}

(where DejaVuSans.ttf and times.ttf are in the current directory, and altdejavu.ttf is either a hardlink or a symlink to DejaVuSans.ttf) with lualatex/xelatex and inspecting the resulting pdf using pdffonts shows that luatex embedded the two "copies" of DejaVu separately.

@sauerburger
Copy link
Member Author

I've tested the implementation of #15320. There seems to be another issue. The PDF contains errors, and now, no character is shown correctly. example.pdf, CI run.

Correct me if I'm wrong, #15320 tracks fonts by their full path (without resolving links) and uses the filename (basename) of the font to build XObject names. Doesn't this lead to problems if two fonts have the same filename, such as Arial/truetype.ttf and OpenSans/truetype.ttf?

I can only think of two truely unique identifiers for the fonts:

  • the (resolved or unresolved) full path of the font and
  • the index of the font tracked in self.fontNames.
    (The inode isn't unique if there are multiple mounted file systems.)

What do you think about using stub font names for XObjects such as font1-minus, font2-sum, ... based on the internal index of the font? If this is desired, I could prepare another PR/update this PR based on #15320.

@anntzer
Copy link
Contributor
anntzer commented Nov 11, 2019

Fair point re duplicate basenames; I guess using e.g. 8000 {index}-{basename} would work?

@sauerburger
Copy link
Member Author

Closing this PR in favor of #15686.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid unicode characters in PDF when font is a symlink
5 participants
0