8000 Bug: `binascii.a2b_uu` incorrectly assumes padded bytes are always whitespace · Issue #100308 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

Bug: binascii.a2b_uu incorrectly assumes padded bytes are always whitespace #100308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ajmedeio opened this issue Dec 16, 2022 · 5 comments
Open
Labels
extension-modules C modules in the Modules dir

Comments

@ajmedeio
Copy link
ajmedeio commented Dec 16, 2022

Bug Description

I was decoding some UUEncoded data when I encountered a 'Trailing Garbage' error from the binascii.a2b_uu function. After digging into Linux's uu decode implementation(L248) and other resources (linked below) I'm decently certain the python implementation is bugged.

The following is what I tried:

from binascii import a2b_uu
s = '%-@     !'
decoded = a2b_uu(s)

The expected output is:

print(decoded)  # b'6\x00\x00\x00\x00'

The actual output is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Trailing garbage

Notice there are 5 bytes in the expected output (b'6\x00\x00\x00\x00') because the % (first byte of input string, s) means 5 bytes of data follow (ascii code 37 - 32 = 5). UUEncoding requires output be divisible by 3 bytes so an extra padding character is added. In this case it's an !.

The python implementation assumes the padding is always whitespace. Different uuencoders will use different characters for padding though. I've seen three so far: , `, and !.

The following several lines of code are the issue

Proposed fix

Simply remove the following lines (279 - 296). Or if we really want the verification of padding we can include the '!' in the condition of valid padding chars. (The linked linux implementation does not verify padding, however.) And based on my research, there isn't a well defined padding character so we will be jumping to the same potentially false conclusion that we have here: believing we've accounted for all the padding characters that exist in the wild.

/*
** Finally, check that if there's anything left on the line
** that it's whitespace only.
*/
while( ascii_len-- > 0 ) {
    this_ch = *ascii_data++;
    /* Extra '`' may be written as padding in some cases */
    if ( this_ch != ' ' && this_ch != ' '+64 &&
         this_ch != '\n' && this_ch != '\r' ) {
        state = get_binascii_state(module);
        if (state == NULL) {
            return NULL;
        }
        PyErr_SetString(state->Error, "Trailing garbage");
        Py_DECREF(rv);
        return NULL;
    }
}

Problematically, this bug propagated up to the uu_codec decode implementation as well. See the following code

A comment indicates the caught exception and "workaround" are due to broken uuencoders. According to what I've read, it's the broken python binascii.a2b_uu that incorrectly assumes any padding bytes are or `.

Here are the sources for my understanding of uu encoding:
Examples of non whitespace padding
Wikipedia uuencoding
Busybox uudecode implementation

Following is an illustration that helped me find a sense of understanding:
uuencode-bug-explanation

[1] I couldn't find an RFC or other standards document so I looked for the earliest implementation I could find (1983 Linux implementation) along with the wikipedia entry.

In the meantime

If others encounter this issue I'm using the following workaround:

import binascii
from binascii import a2b_uu
from io import BytesIO

my_bytes = BytesIO()
line_bytes = b'%-@     !'
line = line_bytes.decode(encoding='ascii')
try:
    my_bytes.write(a2b_uu(line))
except binascii.Error as err:
    if 'trailing garbage' in str(err).lower():
        n_bytes = line_bytes[0] - 32
        assert n_bytes <= 45 and n_bytes <= len(line[1:])
        workaround_line = f'M{line[1:]}'  # replace first byte of UUEncoded line with max length specifier (M)
        data = a2b_uu(workaround_line)[:n_bytes]
        my_bytes.write(data)
    else:
        raise err
@ajmedeio ajmedeio changed the title Bug: binascii.a2b_uu Implementation incorrectly assumes padded bytes are always whitespace WIP: Bug: binascii.a2b_uu Implementation incorrectly assumes padded bytes are always whitespace Dec 20, 2022
@ajmedeio ajmedeio changed the title WIP: Bug: binascii.a2b_uu Implementation incorrectly assumes padded bytes are always whitespace Bug: binascii.a2b_uu Implementation incorrectly assumes padded bytes are always whitespace Dec 21, 2022
@ajmedeio ajmedeio changed the title Bug: binascii.a2b_uu Implementation incorrectly assumes padded bytes are always whitespace Bug: binascii.a2b_uu incorrectly assumes padded bytes are always whitespace Dec 21, 2022
@ericvsmith
Copy link
Member

I guess the question is: are any of the padding bits allowed to be anything but zeros, or the special case (noted in the Wikipedia page) of using an accent grave?

What software did you see that used "!" as a padding character? I guess actually the padding is \x01, which gets converted to "!".

@ajmedeio
Copy link
Author
ajmedeio commented Dec 21, 2022

@ericvsmith I don't know which program generated the uuencoding for these, but public financial reports such as the following (scroll to the very bottom of the txt file) and thousands of other examples across SEC's EDGAR public database have UUEncoded data that uses the ! or as you've correctly stated, \x01, as its padding character.

I also found a StackOverflow post from over a decade ago raising the same issue here.

@ericvsmith
Copy link
Member

Maybe we could add a strict flag, with a default of True. But since the uu module and the uu_codec are scheduled for deprecation, I'm not sure it is worth the effort. See PEP 594.

That said, it looks like uu_codec will actually decode this currently:

>>> import codecs
>>> codecs.decode(b'begin\n%-@     !\nend\n', 'uu_codec')
b'6\x00\x00\x00\x00'

@ajmedeio
Copy link
Author

@ericvsmith I was noticing that as well, that the codecs module could decode the string. If you dig into it, it's using a similar workaround I produced. I just feel like the workaround shouldn't be necessary if the binascii module accepted other characters as padding like the linux and other implementations.

I agree though, a strict=True flag could maintain backwards compat while strict=False could be lenient on different encoders' idea of padding characters. Should we name it strict_padding for more clarity? Or is brevity preferred in this case?

@CAM-Gerlach
80E5 Copy link
Member

Just FYI, the uu module is properly deprecated as of 3.11 (with removal scheduled for 3.13) per PEP 594 (PEP-594), though formal deprecation of the binascii functions and the uu codec in the codecs module is still pending in PR #92758 .

Given that there appear to be a couple niche legacy use cases, perhaps those would be better served by a dedicated PyPI module (perhaps based on the copy-pasted existing binascii and uu module code—#92758 is a pointer to that) developed and maintained by those, like yourself, with the most expertise with the format and who still have some real-world use for it. In the standard library, it has been effectively more or less unmaintained for many years, and you'd typically need to upgrade to a whole new Python feature version to gain new bug fixes, enhancements, etc—which I imagine is particularly undesirable for the legacy systems, applications and users still using this ancient and otherwise all-but-obsolete format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir
Projects
None yet
Development

No branches or pull requests

4 participants
0