diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index a4254d43e5a..038b2b35741 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -663,7 +663,7 @@ peps/pep-0782.rst @vstinner peps/pep-0783.rst @hoodmane @ambv peps/pep-0784.rst @gpshead peps/pep-0785.rst @gpshead -# ... +peps/pep-0786.rst @ncoghlan peps/pep-0787.rst @ncoghlan peps/pep-0788.rst @ZeroIntensity @vstinner # ... diff --git a/peps/pep-0786.rst b/peps/pep-0786.rst new file mode 100644 index 00000000000..d36594403dd --- /dev/null +++ b/peps/pep-0786.rst @@ -0,0 +1,354 @@ +PEP: 786 +Title: Precision and modulo-precision flag format specifiers for integer fields +Author: Jay Berry +Sponsor: Alyssa Coghlan +Status: Draft +Type: Standards Track +Created: 04-Apr-2025 +Python-Version: 3.15 +Post-History: `14-Feb-2025 `__, + + +Abstract +======== + +This PEP proposes implementing the standard format specifier ``.`` of :pep:`3101` as "precision" for integers formatted with the binary, octal, decimal, and hexadecimal presentation types, and implementing the standard format specifier ``z`` as a "modulo" flag for integers formatted with precision and the binary, octal, and hexadecimal presentation types, which first reduces the integer into ``range(base ** precision)``, resulting in a predictable two's complement style formatting. + +Both "precision" (``.``), and the "modulo-precision" flag (``z``) are presented in this PEP, as the alternative rejected implementations entail combinations of both. + +This PEP amends the clause of :pep:`3101` which states "[t]he precision is ignored for integer conversions". + + +Rationale +========= + +When string formatting integers in binary octal and hexadecimal, one often desires the resulting string to contain a guaranteed minimum number of digits. For unsigned integers of known machine-width bounds (for example, 8-bit bytes) this often also ends up the exact resulting number of digits. This has previously been implemented in the old-style ``%`` formatting using the ``.`` "precision" format specifier, closely related to that of the C programming language. + +.. code-block:: python + + >>> "0x%.2x" % 15 + '0x0f' # two hex digits, ideal for displaying an unsigned byte + >>> "0o%.3o" % 18 + '0o022' # three octal digits, ideal for displaying a umask or file permissions + +When :pep:`3101` new-style formatting was first introduced, used in ``str.format`` and f-strings, the `format specification `_ was simple enough that the behavior of "precision" could be trivially emulated with the ``width`` format specifier. Precision therefore was left unimplemented and forbidden for ``int`` fields. However, as time has progressed and new format specifiers have been added, whose interactions with ``width`` noticeably diverge its behavior away from emulating precision, the readmission of precision as its own format specifier, ``.``, is sufficiently warranted. + +The ``width`` format specifier guarantees a minimum length of the entire replacement field, not just the number of digits in a formatted integer. For example, the wonderful ``#`` specifier that prepends the prefix of the corresponding presentation type consumes from ``width``: + +.. code-block:: python + + >>> x = 12 + >>> f"0x{x:02x}" # manually specifying '0x' prefix + '0x0c' # two hex digits :) + >>> f"{x:#02x}" # use '#' format specifier to output '0x' automatically + '0xc' # only one hex digit :( + >>> f"{x:#08b}" + '0b001100' # we wanted 8 bits, not 6 :( + +One could attempt to argue that since the length of a prefix is known to always be 2, it can be accounted for manually by adding 2 to the desired number of digits. Consider however the following demonstrations of why this is a bad idea: + +* By correcting the second example to ``f"{x:#04x}"``, at a glance this looks like it may produce four hex digits, but it only produces two. This is bad for readability. ``4`` is thus too much of a 'magic number', and trying to counter that by being overly explicit with ``f"{x:#0{2+2}x}"`` looks ridiculous. +* In the future it is possible that a type specifier may be added with a prefix not of length 2, meaning the programmer has to calculate the prefix length, rather than Python's internal string formatting code handling that automatically. +* Things get more complicated when using the ``sign`` format specifier, ``f"{x: #0{1+2+2}x}"`` required to produce ``' 0x0c'``. +* Things get *even more* complicated when introducing a ``grouping_option``, for example formatting an integer into ``k`` 'word' segments joined by ``_``: ``x = 3735928559; k = 2; f"{x: #0{1+2+4*k+(k - 1)}_x}"`` is required to produce ``' 0xdead_beef'``. Surely this would be easier to write with precision as ``f"{x: #_.8x}"``? + +It is clear at this point that the reduction of complexity that would be provided by precision's implementation for ``int`` fields would be beneficial to any user. Nor is this proposal a new special-case behavior being demanded exclusively at the behest of ``int`` fields: the precision token ``.`` is already implemented as prescribed in :pep:`3101` for ``str`` data to truncate the field's length, and for ``float`` data to ensure that there are a fixed number of digits after the decimal point, eg ``f"{0.1+0.2: .4f}"`` producing ``' 0.3000'``. Thus no new tokens need adding to the `format specification `_ because of this proposal, maintaining its modest size. + +For the sake of completion, and lack of any reasonable objection, we propose that precision shall work also in decimal, base 10. Explicitly, the integer presentation types laid out in :pep:`3101` that are permitted to implement precision are ``'b'``, ``'d'``, ``'o'``, ``'x'``, ``'X'``, ``'n'``, and ``''`` (``None``). The only presentation type not permitted is ``c`` ('character'), whose purpose is to format an integer to a single Unicode character, or an appropriate replacement for non-printable characters, for which it does not make sense to implement precision. In the event that new integer presentation types are added in the future, such as ``'B'`` and ``'O'`` which mutatis-mutandis could provide the same behavior as ``'X'`` (that is a capitalized prefix and digits), their addition should appropriately consider whether precision should be implemented or not. In the case of ``'B'`` and ``'O'`` as described here it would be correct to implement precision. A ``ValueError`` shall be raised when precision is attempted to be used for invalid integer presentation types. + + +Precision For Negative Numbers +------------------------------ + +So far in this PEP we have cautiously avoided talking about the formatting of negative numbers with precision, which we shall now discuss. + + +Short Verdict +''''''''''''' + +We desire two behaviors, which motivates the implementation of a flag ``z`` to toggle on the latter's behavior: + +* For precision without the ``z`` flag, a negative integer ``x`` shall be formatted with a negative sign and the digits of ``-x``'s formatting. This is the same friendly behavior as old-style ``%`` formatting. + + For example ``f"{-12:#.2x}"`` shall produce ``'-0x0c'``, equivalent to ``"%#.2x" % -12``. + +* For precision with the ``z`` flag, ``q, r = divmod(x, base ** n)`` is first taken when formatting ``f"{x:z.{n}{base_char}}"``, and ``r`` is passed on to precision, the resulting string being equivalent to ``f"{r:.{n}{base_char}}"``. Because ``r`` is in ``range(base ** n)`` the number of digits will always be exactly ``n``, resulting in a predictable two's complement style formatting, which is useful to the end user in environments that deal with machine-width oriented integers such as :mod:`struct`. + + For example in formatting ``f"{-1:z#.2x}"``, ``-1`` is reduced modulo ``256`` via ``-1, 255 = divmod(-1, 256)``, the resulting string being equivalent to ``f"{255:#.2x}"``, which is ``'0xff'``. + + The ``z`` flag shall only be implemented for presentation types corresponding to bases that are powers of two, specifically at present binary, octal, and hexadecimal. Whilst reduction of integers modulo by powers of ten is computationally possible, a 'ten's complement?' has no demand and so precision is unimplemented for decimal presentation types. The ``z`` flag shall work for all integers, not just negatives. + + The syntax choice of ``z`` is again out of respect for maintaining the modest size of the `format specification `_. ``z`` was introduced to the format specification in :pep:`682` as a flag for normalizing negative zero to positive zero for the ``float`` and ``Decimal`` types. It is currently unimplemented for the ``int`` type, and since integers never have a 'negative zero' situation it seems uncontroversial to repurpose ``z``, again as a flag. If one squints hard enough, the ``z`` looks like a ``2`` for two's complement! + + +Long Introspection +'''''''''''''''''' + +We first present some observations about the binary representations of *signed* integers in two's complement. This leads us to a couple of alternative formulations of formatting negative numbers. + +Observe that one can always extend a signed number's binary representation by extending the the leading digit as a prefix: + +.. code-block:: text + + 45 (8-bit) 00101101 + 45 (9-bit) 000101101 + -19 (8-bit) 11101101 + -19 (9-bit) 111101101 + +For non-negative numbers this is obvious. For negative numbers this is because the erstwhile leading column of an ``n``\ -bit representation goes from having a value of ``-2 ** (n-1)``, to ``+2 ** (n-1)``, with a new ``n+1``\ th column of value ``-2 ** n`` prefixed on, the overall sum unaffected. + +This is what C's ``printf`` does, working with powers of two as the numbers of digits: + +.. code-block:: C + + printf("%#hhb\n", -19); // 0b11101101 + printf("%#hho\n", -19); // 0355 + printf("%#hhx\n", -19); // 0xed + + printf("%#b\n", -19); // 0b11111111111111111111111111101101 + printf("%#o\n", -19); // 037777777755 + printf("%#x\n", -19); // 0xffffffed + +Conversely it should be clear that one can losslessly truncate a signed number's binary representation to have only one leading ``0`` if it is non-negative, and one leading ``1`` if it is negative: + +.. code-block:: text + + 45 (8-bit) 00101101 + 45 (7-bit) 0101101 + -19 (8-bit) 11101101 + -19 (7-bit) 1101101 + +If one were to truncate another digit off of these examples, then both would end up as ``101101``, 45 being indistinguishable from -19 when using only 6 binary digits because they are both the same modulo ``2 ** 6 = 64``. Therefore to losslessly and unambiguously represent a signed integer ``x`` as a binary string which is rendered to the end user, we have a de facto 'minimal width' representation convention, using ``n`` digits, where ``n`` is the smallest integer such that ``x`` is in ``range(-2 ** (n-1), 2 ** (n-1))``. + +For rendering octal and hexadecimal strings one has to extend the definition of the 'minimal width' representation convention to be sufficiently unambiguous. 383's minimal width binary string is ``0101111111``, and -129's is ``101111111``, a suffix of the former's. A naive, incorrect, implementation of hexadecimal string formatting would render both as ``'0x17f'`` by *padding* both binary representations to ``000101111111``. The method was correct to desire a number of binary digits (12) that is divisible by the number of bits in the base (4 bits in base 16) so that the binary representation can be segmented up into (hex) digits, but it was incorrect in *padding*; the method should have instead *extended* as we have observed previously, 383 extended to ``000101111111``, and -129 extended to ``111101111111``, whence 383 is rendered as ``'0x17f'`` and -129 as ``0xf7f``. + +Thus the generalized definition of our 'minimal width' representation convention is: for an integer ``x`` to rendered in base ``base``, produce ``n`` digits, where ``n`` is the smallest integer such that ``x`` is in ``range(-base ** n / 2, base ** n / 2)``. + +This leads onto the rejected alternatives. + + +Rejected Alternatives +===================== + +Behavior of ``z`` +----------------- + +The desired implementation of ``z``, the two's complement style formatting flag, has split into two main camps of opinions, disagreeing over lossless vs lossy presentation. The lossless camp believes that the formatted strings corresponding to integers should all be distinct from each other, uniqueness preserved by the minimal width representation convention; precision with ``z`` enabled should still be only a *minimum* number of digits requested, as it is without ``z``. The lossy camp believes that precision with ``z`` enabled should first reduce the integer using modular arithmetic, which then produces *exactly* the number of digits requested, equivalent to left-truncating the minimal width representation string. + +We endeavor to conclude in the following section that the former camp, lossless formatting, has no use cases, and is thus a rejected idea, whence this PEP proposes the latter, lossy, behavior. + + +Minimal Width Representation Convention +''''''''''''''''''''''''''''''''''''''' + +This idea was fiercely entertained only due to its lossless behavior, however it is a obstacle to ergonomics in every candidate use case. These arguments about the aesthetics of string rendering are not irrational or about personal taste, but rather they are crucial in how information is communicated to the end user. + +In a program in which signed-ness of integers is critical to communicate, any implementation of ``z`` should not be used, as the average user will be expecting to see a negative sign ``-``. The alternative of using minimal width representation convention requires one to be uncomfortably vigilant looking for leading digits of numbers belonging to the upper half of the base's range whenever a negative number is present (``1`` for binary, ``4-7`` for octal, and ``8-f`` for hex). Any end user that is not aware of this de facto convention, and even those who are but are not expecting it to be present in a program, would have a hard time: + +The formatting of 128 and -128 using ``f"{x:z#.2x}"`` would produce ``'0x080'`` and ``'0x80'`` respectively. It is the PEP author's opinion that there is a 0% chance that ``'0x80'`` is being read as *negative* 128 under normal conditions. Furthermore the hideous rendering of positive 128 as ``'0x080'`` is useless for a program that should produce a uniformly spaced hexdump of bytes, agnostic of whether they are signed or unsigned; all bytes should be rendered in the form ``'0xNN'``. See the `examples <#modulo-precision>`__ section on how modulo-precision handles bytes in the correct sign-agnostic way. + +Contrapositively therefore ``z``'s purpose is to be used in environments where signed-ness is *not* critical, and more likely than not where it is even encouraged to treat the integers with respect to the modular arithmetic that arises in two's complement hardware of fixed register sizes. In the example above 128 and -128 are the same modulo 256, and the respectable rendering is ``'0x80'``. In general the purpose of ``z`` is to treat integers modulo ``base ** precision`` as the same. So too 255 and -1 should both be rendered as ``'0xff'``, not ``'0x0ff'`` and ``'0xff'`` respectively; the truncation is not a hindrance, but the desired behavior. Formally we may say that the formatting should be a well defined bijection between the equivalence classes of ``Z/(base ** precision)Z`` and strings with ``precision`` digits. + +The remaining question is "[sic] is there no chance to communicate this truncation to user?" as a concern for the 'loss of information' arising from the effectively left-truncated strings. We reject this question's premise that there ever is such a case of unintentional loss of information, considering the two cases of hardware-aware integers and otherwise: + +So far we have played around with examples of bytes in ``range(-128, 256)``, the union of the signed and unsigned ranges, with respect to which the virtues of formatting ``x`` and ``x - 256`` as the same are clearly established. In the hardware-aware contexts that one expects to find ``z``, any integers corresponding to bytes that lie outside that range are likely a programming error. For example if a library sets a pixel brightness integer to be 257, and prints out ``'0x01'`` instead of ``'0x101'`` via ``f"{x:z#.2x}"``, that's not our problem or doing; string formatting shouldn't raise an exception, or even a ``SyntaxWarning`` as an invalid escape sequence ``"\y"`` would, because ``ValueError: bytes must be in range(0, 256)`` will be raised by ``bytes`` when trying to serialize that integer via ``bytes([257])``; let the appropriate 'layer' of code raise the exception, as that is more indicative of a defect in the library, not our string formatting. + +In the case of non-hardware aware integers one would have to intentionally opt to use ``z``, in which modular arithmetic is the chosen desired effect. It is for this reason also that we shall not raise a ``SyntaxWarning`` or ``ValueError`` for integers lying outside of ``range(-base ** precision / 2, base ** precision)``. + +.. + XXX Give a good example of non-hardware aware use of modular arithmetic formatting like Minecraft buried treasure always being at 8,8 within a chunk. + +Thus we have defended the lossy behavior of ``z`` implemented as modulo-precision, and we have exhausted all reasonable use cases of lossless behavior. + +A final compromise to consider and reject is implementing ``z`` not as a flag *dependent* on ``.``, but as a flag that can be *combined* with ``.``. Specifically: ``z`` without ``.`` would turn on two's complement mode to render the minimal width representation of the formatted integer, ``.`` without ``z`` would implement precision as already explained, a minimum number of digits in the magnitude and a sign if necessary, and ``z`` combined with ``.`` would turn on the left-truncating modulo-precision. This labyrinth of combinations does not seem useful to anyone, as we have already discredited the ergonomics of minimal width representation convention, whence ``z`` would rarely be used on its own, and this behavior of two options that individually render a *minimum* number of digits combining together to render an *exact* number of digits seems counterintuitive. + + +Infinite Length Indication +'''''''''''''''''''''''''' + +Another, less popular, rejected alternative was for ``z`` to directly acknowledge the infinite prefix of ``0``\ s or ``1``\ s that precede a non-negative or negative number respectively. For example: + +.. code-block:: python + + >>> f"{-1:z#.8b}" + '0b[...1]11111111' + >>> f"{300:z#.8b}" + '0b[...0]100101100' + +This is effectively the minimal width representation convention with an 'infinite' prefix attached to it. + +In the C programming language the machine-width dependent two's complement formatting of ``int`` data with precision exhibits excessive lengths of prefixes that arise from negative numbers, even those with small magnitude: + +.. code-block:: C + + printf("%#.2x\n", -19); // 0xffffffed + printf("%#.2llx\n", (long long unsigned int)-19); // 0xffffffffffffffed + +This prefix could continue on indefinitely if it were not limited by a maximum machine-width! + +Python's ``int`` type is indeed not limited by a maximum machine-width. Thus to avoid printing infinitely long two's complement strings we could use a similar approach to that of the builtin ``list``'s string formatting for printing a list that contains itself: + +.. code-block:: python + + >>> l = [] + >>> l.append(l) + >>> l + [[...]] + + >>> y = -1 + >>> f"{y:z#.8b}" + '0b[...1]11111111' + +This may have been useful to educate beginners on how bitwise binary operations work, for example showing how ``-1 & x`` is always trivially equal to ``x``, or how the binary representation of the negation of a number can be obtained by adding one to its bitwise complement: + +.. code-block:: python + + >>> x = 42 + >>> f"{x:z#.8b}" + '0b[...0]00101010' + >>> f"{~x:z#.8b}" + '0b[...1]11010101' + >>> f"{x|~x:z#.8b}" + '0b[...1]11111111' + # x | ~x == -1 + # x | ~x == x + ~x because of their disjoint bitwise representations + # thus x + ~x == -1 + # thus -x == ~x + 1 + >>> y = ~x + 1 + >>> f"{y:z#.8b}" + '0b[...1]11010110' + >>> y == -x + True + +Its use case is just too narrow, and modulo-precision outshines it. + + +General +------- + +* What about ones's complement, or other binary representations? + + Two's complement is so dominant that no one really considers other representations. GCC only supports two's complement. + +* Could we do nothing? + + Programmers continue to hobble on using the ``width`` format specifier with ad-hoc corrections to mimic precision. This is intolerable, and the rationale of this PEP makes conclusive arguments for the addition and implementation choices of precision. + + Refusing to implement precision for integer fields using ``.`` reserves ``.`` for possible future uses. However in the ~20 year timespan since :pep:`3101` no alternatives have been accepted, and any alternate use of ``.`` takes it further out of sync with both old-style ``%`` formatting, and the C programming language. + + +Syntax +------ + +* ``!`` instead of ``z.`` for precision with modulo-precision, mutually exclusive with ``.``. + + Pros: + + - ``!`` is graphically related to ``.``, an extension if you will. Precision with the modulo-precision flag set is indeed an extension of precision. + - ``!`` in the English language is often used for imperative, commanding sentences. So too modulo-precision commands the *exact* number of digits to which its input shall be formatted, whereas precision is the *minimum* number of digits. This is idiomatic. + - ``!`` is only one symbol as opposed to ``z.``. This coupled with ``!`` being mutually exclusive with ``.`` leaves the overall length of one's written code unaffected when switching on modulo-precision. + - Using a new ``!`` symbol reserves ``z`` for other future uses, whatever that may be. + + Cons: + + - ``z.`` also conveys a sense of extension from ``.``, a flag attached to ``.``, and lexicographically flows left to right as 'modulo' (``z``) 'precision' (``.``). + - ``.`` and ``!`` being mutually exclusive to each other may give a beginner programmer analysis-paralysis over which to choose when looking at the `format specification `_ documentation. + - ``!`` would be another addition to the format specification for a single purpose. It would not have any implementation for ``str``, ``float``, or any other type. + - There also already exists a ``["!" conversion]`` "explicit conversion flag" in the `format string syntax `_ as laid out in :pep:`3101`. For example in ``f"{s!r}"`` the ``!r`` calls ``repr`` on ``s``. This would *not* syntactically clash with a ``!`` format specifier, the format specifiers ``[":" format_spec]`` being separated by a well-defined preceding colon, however users unfamiliar with the new modulo-precision mode may glance over format strings containing ``!`` and expect different behavior. + + Verdict: + + - Whilst graphically attractive, ``!`` would clutter the format specification for a single purpose that can be achieved by overloading the preexisting ``z`` flag. + + +Backwards Compatibility +======================= + +To quote :pep:`682`: + + The new formatting behavior is opt-in, so numerical formatting of existing programs will not be affected. + +unless someone out there is specifically relying upon ``.`` raising a ``ValueError`` for integers as it currently does, but to quote :pep:`475`: + + The authors of this PEP don't think that such applications exist + + +Examples And Teaching +===================== + +Precision +--------- + +Documentation and tutorials in the Python sphere of influence should encourage the adoption of ``.``, precision, as the default format specifier for formatting ``int`` fields as opposed to ``width``, when it is clear a minimum number of *digits* is required, not a minimum length of the *whole replacement field*. + +Since the concept of precision is common in other languages such as C, and was already present in Python's old-style ``%`` formatting, we don't need to go *too* overboard, but a decent few examples as below may demonstrate its uses. + +.. code-block:: python + + >>> def hexdump(b: bytes) -> str: + ... return " ".join(f"{c:#.2x}" for c in b) + + >>> hexdump(b"GET /\r\n\r\n") + '0x47 0x45 0x54 0x20 0x2f 0x0d 0x0a 0x0d 0x0a' + # observe the CR and LF bytes padded to precision 2 + # in this basic HTTP/0.9 request + + >>> def unicode_dump(s: str) -> str: + ... return " ".join(f"U+{ord(c):.4X}" for c in s) + + >>> unicode_dump("USA 🦅") + 'U+0055 U+0053 U+0041 U+0020 U+1F985' + # observe the last character's Unicode codepoint has 5 digits; + # precision is only the minimum number of digits + + +Modulo-Precision +---------------- + +The clear area for encouraging the use of modulo-precision is when dealing with machine-width oriented integers such as those packed and unpacked by :mod:`struct`. We give an example of the consistent predictable two's complement formatting of signed and unsigned integers. + +.. code-block:: python + + >>> import struct + + >>> my_struct = b"\xff" + >>> (t,) = struct.unpack('b', my_struct) # signed char + >>> print(t, f"{t:#.2x}", f"{t:z#.2x}") + '-1 -0x01 0xff' + >>> (t,) = struct.unpack('B', my_struct) # unsigned char + >>> print(t, f"{t:#.2x}", f"{t:z#.2x}") + '255 0xff 0xff' + + # observe in both the signed and unsigned unpacking the modulo-precision flag 'z' + # produces a predictable two's complement formatting + + +Thanks +====== + +Thank you to + +* Sergey B Kirpichev, for discussions and implementation code. +* Raymond Hettinger, for the initial suggestion of the two's complement behavior. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. + + +TODO AND REMOVE BEFORE MERGE +============================ + +* Format all lines to ~80 characters. I've left this formatting until we're happy with the contents. +* RFC 2119 Style Specification? After all is said and done here. +* Give a good example of non-hardware aware use of modular arithmetic formatting, my brain has gone blank... + + +Footnotes +========= + +.. _formatstrings: https://docs.python.org/3/library/string.html#formatstrings +.. _formatspec: https://docs.python.org/3/library/string.html#formatspec