Consistently use realpaths to build XObject names #15629

sauerburger · 2019-11-07T18:47:21Z

PR Summary

This PR addresses an issue in the PDF backend which lead to invalid PDFs. The PDF backend used two inconsistent ways to build the name for XObjects required for Unicode glyphs above code point 255.

The first method is based on the file name of the font returned by font_manager.findfont()`. This method is used to refer to the XObject glyphs when drawing text.
When the XObjects are created inside writeFonts() the realpath of the font returned by cbook.get_realpath_and_stat() is used instead.

The realpath differs from the file name of the font if the font is a symbolic link. In this case, the produced PDF contains invalid references to XObjects.

This commit adds a safe-guard in PdfFile._get_xobject_symbol_name(). The returned XObject names are always based on the realpath, even if the supplied path is a link. This achieves consistent XObject names and prevents invalid PDF files.

This PR resolves #15628.

PR Checklist

Has Pytest style unit tests
Code is Flake 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

Before this comment, the PDF backend used two inconsistent ways to build the name for XObjects required for Unicode glyphs above code point 255. The first method is based on the file name of the font returned by `font_manager.findfont()`. This method is used to refer to the XObject glyphs when drawing text. When the XObjects are created inside `writeFonts()` the realpath of the font returned by `cbook.get_realpath_and_stat()` is used instead. The realpath differs from the file name of the font if the font is a symbolic link. In this case, the produced PDF contains invalid references to XObjects. This commit adds a safe-guard in `PdfFile._get_xobject_symbol_name()`. The returned XObject names are always based on the realpath, even if the supplied path is a link. This achieves consistent XObject names and prevents invalid PDF files.

QuLogic · 2019-11-08T01:49:07Z

This does fix it, but do you know if all callers of _get_xobject_symbol_name process filenames in this way?

sauerburger · 2019-11-08T08:14:02Z

The method _get_xobject_symbol_name is used in three places:

Within embedTTF()/embedTTFType3(): embedTTF() gets the filename from its caller, writeFonts(), which uses the realpath. So, _get_xobject_symbol_name is called with the realpath already. The PR has not effect here.
Within draw_text(): Here, the _get_xobject_symbol_name is called with font.fname, which could be a link. With the fix, the _get_xobject_symbol_name, would resolve the link to its realpath.
Within draw_mathtext(): Here, it is harder to trace. MathTextParser creates a list of glyphs and their file names. As far as I can tell this is also based on font.fname, which again could be a link. With the fix, the _get_xobject_symbol_name, would resolve the link to its realpath.

Since the _get_xobject_symbol_name doesn't modify its arguments nor returns the path, it should not affect anything else besides the XObject names.

anntzer · 2019-11-08T09:28:56Z

While the fix seems reasonable, do we really need to resolve font realpaths? Can't we just use fonts under whatever path they are found by findfont(), without ever resolving them? Note that for a given request, findfont() should not ever return the same font under two different (symlinked) paths, so the only case we can end up with the same font embedded twice is if the user specifies two symlinks to the same font by explicit paths -- which should be rare enough, and just results in a file with the same font embedded under two different names, i.e. just a missed optimization on the file size.

QuLogic · 2019-11-08T09:32:30Z

Looking at the history, two symlinks pointing to the same file was exactly why this link-resolving code was added in the first place: c8ae4e2. Has findfont changed since then to make this redundant?

anntzer · 2019-11-08T09:35:07Z

findfont could well call realpath() itself on its output, saving the need for the callers to do so (but this would fail with hardlinks -- in fact the use of stat() rather than realpath() in the linked commit suggests that it intended to handle hardlinks as well).
the linked commit does not explain why this is an issue.

anntzer · 2019-11-08T09:58:57Z

Actually I just realized that I already implemented my suggestion as #15320; @sauerburger can you check whether that PR fixes your issue?

QuLogic · 2019-11-11T05:18:09Z

the linked commit does not explain why this is an issue.

Well, it came in around the same time as subsetting itself, so I assume the point is to make the subset as minimal as possible

anntzer · 2019-11-11T08:39:23Z

I simply don't believe the same font file being accessed from two different paths occurs frequently enough (or at all...) to make the path canonicalization worthwhile. One can for example check that luatex/xetex (for which an important selling point is indeed better font support than pdftex...) don't bother with this optimization: compiling

\documentclass{article}
\usepackage{fontspec}
\begin{document}
\setmainfont{DejaVuSans.ttf} Hello
\setmainfont{times.ttf} world!
\setmainfont{altdejavu.ttf} And hello
\setmainfont{times.ttf} again!
\end{document}

(where DejaVuSans.ttf and times.ttf are in the current directory, and altdejavu.ttf is either a hardlink or a symlink to DejaVuSans.ttf) with lualatex/xelatex and inspecting the resulting pdf using pdffonts shows that luatex embedded the two "copies" of DejaVu separately.

sauerburger · 2019-11-11T20:15:24Z

I've tested the implementation of #15320. There seems to be another issue. The PDF contains errors, and now, no character is shown correctly. example.pdf, CI run.

Correct me if I'm wrong, #15320 tracks fonts by their full path (without resolving links) and uses the filename (basename) of the font to build XObject names. Doesn't this lead to problems if two fonts have the same filename, such as Arial/truetype.ttf and OpenSans/truetype.ttf?

I can only think of two truely unique identifiers for the fonts:

the (resolved or unresolved) full path of the font and
the index of the font tracked in self.fontNames.
(The inode isn't unique if there are multiple mounted file systems.)

What do you think about using stub font names for XObjects such as font1-minus, font2-sum, ... based on the internal index of the font? If this is desired, I could prepare another PR/update this PR based on #15320.

anntzer · 2019-11-11T20:49:48Z

Fair point re duplicate basenames; I guess using e.g. 8000 {index}-{basename} would work?

sauerburger · 2019-11-19T09:58:47Z

Closing this PR in favor of #15686.

dstansby added the backend: pdf label Nov 7, 2019

tacaswell added this to the v3.3.0 milestone Nov 8, 2019

anntzer mentioned this pull request Nov 11, 2019

Rely on findfont autofallback-to-default in pdf/ps backends. #15673

Merged

6 tasks

This was referenced Nov 12, 2019

Simplify and unify character tracking in pdf and ps backends. #15320

Closed

Simplify and unify character tracking in pdf and ps backends (with linked fonts) #15686

Merged

sauerburger closed this Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Consistently use realpaths to build XObject names #15629

Consistently use realpaths to build XObject names #15629

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Consistently use realpaths to build XObject names #15629

Consistently use realpaths to build XObject names #15629

Conversation

PR Summary

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!