10000 [WIP] introduce a distinct bytes type. by gvanrossum · Pull Request #580 · python/typeshed · GitHub
[go: up one dir, main page]

Skip to content

[WIP] introduce a distinct bytes type. #580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[WIP] introduce a distinct bytes type. #580

wants to merge 1 commit into from

Conversation

gvanrossum
Copy link
Member

Copy link
Member Author
@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this is only the tip of the iceberg, revealed by mypy's test suite which doesn't exercise the rest of the stdlib much.

@@ -114,6 +114,8 @@ class int(SupportsInt, SupportsFloat, SupportsAbs[int]):
def __abs__(self) -> int: ...
def __hash__(self) -> int: ...

long = int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this to a more logical place.

@@ -210,7 +212,7 @@ class unicode(basestring, Sequence[unicode]):
def center(self, width: int, fillchar: unicode = u' ') -> unicode: ...
def count(self, x: unicode) -> int: ...
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> str: ...
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode() is just about the strongest signal that we want bytes, apart from using b'' literals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should things like encoding have the type Text or NativeStr (or even str)? Consider various cases:

  1. encoding: Text. If we have 'x' -> NativeStr then u'foo'.encode('utf8') would be a type error, since NativeStr is not compatible with Text.
  2. encoding: NativeStr. Then u'foo'.encode(u'utf8') would result in a false positive.
  3. encoding: str. Then we really don't have any type checks, any string-like object will be OK.

The idea of adding NativeStr was that in PY2 it represents cases where an implicit unicode->bytes ASCII conversion takes place. And this is the case for encoding in unicode.encode().

But having to write u'foo'.encode(NativeStr(u'utf8')) is not that convenient. (Here the ASCII types proposal would have worked better, but only when you use literals).

On the other hand, if you have your encoding name as a string parameter (as opposed to a literal, and maybe it's a more important case overall) and you make encoding in unicode.encode() to be of type NativeStr (as its run-time semantics suggests), then the cases are like that:

  1. def f(encoding: Text): u'foo'.encode(encoding). You have to use u'foo'.encode(encoding.encode('ascii') if PY2 else encoding). Here you have to make your implicit conversion explicit by saying that you believe that this text can always be ASCII-converted to a 8-bit string. This way you make your code PY2+3 compatible.
  2. def f(encoding: NativeStr): u'foo'.encode(encoding). Everything looks good here.
  3. def f(encoding: str): u'foo'.encode(encoding). Again everything is OK, but it looks like another way of saying that you are sure that you are ASCII-compatible here.

So, a possible conclusion from this is: if you are sure that your argument gets implicitly ASCII-converted to a 8-bit string, then you should use NativeStr for it. If you want to state that you don't really care about checking for implicit ASCII conversions (and PY2+3 compatibility in general), you should use str.

@@ -366,11 +368,102 @@ class str(basestring, Sequence[str]):
def __ge__(self, x: unicode) -> bool: ...
def __mod__(self, x: Any) -> str: ...

class bytes(basestring, Sequence[bytes]):
# TODO: double-check unicode, AnyStr, unions, bytearray
Copy link
Member Author
@gvanrossum gvanrossum Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far this is just a clone of str() above with all occurrences of str replaced by bytes.

def center(self, width: int, fillchar: bytes = ...) -> bytes: ...
def count(self, x: unicode) -> int: ...
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, bytes probably shouldn't have encode() at all.

def count(self, x: unicode) -> int: ...
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...
def endswith(self, suffix: Union[unicode, Tuple[unicode, ...]]) -> bool: ...
Copy link
Member Author
@gvanrossum gvanrossum Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use Union[bytes, Tuple[bytes, ...]], and similar for most other uses of unicode in this class.

@@ -913,8 +1003,8 @@ class file(BinaryIO):
def __init__(self, file: unicode, mode: str = 'r', buffering: int = ...) -> None: ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe we can commandeer 'file' (a PY2-only class) to be the subclass of IO[str] I wished for above.

@@ -18,30 +18,30 @@ class IOBase:
...

class BytesIO(BinaryIO):
def __init__(self, initial_bytes: str = ...) -> None: ...
def __init__(self, initial_bytes: bytes = ...) -> None: ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this file honestly looks like an improvement (I may even keep it regardless of what we do here, since bytes would just be an alias for str).

@@ -74,7 +74,7 @@ class StringIO(TextIO):

class TextIOWrapper(TextIO):
# write_through is undocumented but used by subprocess
def __init__(self, buffer: IO[str], encoding: unicode = ...,
def __init__(self, buffer: IO[bytes], encoding: unicode = ...,
Copy link
Member Author
@gvanrossum gvanrossum Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I've got a feeling that the buffer argument here is actually more restricted than IO[] -- it probably needs to be derived from BufferedIOBase.

@@ -34,63 +34,63 @@ def compile(pattern: AnyStr, flags: int = ...) -> Pattern[AnyStr]: ...
def compile(pattern: Pattern[AnyStr], flags: int = ...) -> Pattern[AnyStr]: ...

@overload
def search(pattern: Union[str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...
def search(pattern: Union[bytes, str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every API in this file got amended to add bytes to the union or overload. Beware, it's the pattern that's got the unions -- the search target uses AnyStr and the return type varies with that. But (at least in PY2) the pattern can vary independently.

@@ -30,7 +30,7 @@ DefaultDict = TypeAlias(object)
Set = TypeAlias(object)

# Predefined type variables.
AnyStr = TypeVar('AnyStr', str, unicode)
AnyStr = TypeVar('AnyStr', bytes, str, unicode)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is key, but also annoying. E.g. this code breaks because of this:

def f(a): # type: (AnyStr) -> AnyStr
    if isinstance(a, str): return '.'
    else: return u'.'

You have to add a third case (elif isinstance(a, bytes): return b'.'). You can'y even collapse this as isinstance(a, (str, bytes)) because the return type (in this example) must match the argument.

@@ -210,7 +212,7 @@ class unicode(basestring, Sequence[unicode]):
def center(self, width: int, fillchar: unicode = u' ') -> unicode: ...
def count(self, x: unicode) -> int: ...
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> str: ...
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should things like encoding have the type Text or NativeStr (or even str)? Consider various cases:

  1. encoding: Text. If we have 'x' -> NativeStr then u'foo'.encode('utf8') would be a type error, since NativeStr is not compatible with Text.
  2. encoding: NativeStr. Then u'foo'.encode(u'utf8') would result in a false positive.
  3. encoding: str. Then we really don't have any type checks, any string-like object will be OK.

The idea of adding NativeStr was that in PY2 it represents cases where an implicit unicode->bytes ASCII conversion takes place. And this is the case for encoding in unicode.encode().

But having to write u'foo'.encode(NativeStr(u'utf8')) is not that convenient. (Here the ASCII types proposal would have worked better, but only when you use literals).

On the other hand, if you have your encoding name as a string parameter (as opposed to a literal, and maybe it's a more important case overall) and you make encoding in unicode.encode() to be of type NativeStr (as its run-time semantics suggests), then the cases are like that:

  1. def f(encoding: Text): u'foo'.encode(encoding). You have to use u'foo'.encode(encoding.encode('ascii') if PY2 else encoding). Here you have to make your implicit conversion explicit by saying that you believe that this text can always be ASCII-converted to a 8-bit string. This way you make your code PY2+3 compatible.
  2. def f(encoding: NativeStr): u'foo'.encode(encoding). Everything looks good here.
  3. def f(encoding: str): u'foo'.encode(encoding). Again everything is OK, but it looks like another way of saying that you are sure that you are ASCII-compatible here.

So, a possible conclusion from this is: if you are sure that your argument gets implicitly ASCII-converted to a 8-bit string, then you should use NativeStr for it. If you want to state that you don't really care about checking for implicit ASCII conversions (and PY2+3 compatibility in general), you should use str.

@@ -716,11 +806,11 @@ def next(i: Iterator[_T]) -> _T: ...
def next(i: Iterator[_T], default: _T) -> _T: ...
def oct(i: int) -> str: ... # TODO __index__
@overload
def open(file: str, mode: str = 'r', buffering: int = ...) -> BinaryIO: ...
def open(file: str, mode: str = 'r', buffering: int = ...) -> IO[str]: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it return IO[bytes] in PY2? As you've mentioned, the data in PY2 is always 8-bit strings, so open() clearly behaves way too different in PY2 and PY3 except for open(..., '...b') which is bytes in PY2+3. There are more compatible functions codecs.open() and io.open() where you can return bytes or Text depending on the mode parameter for both PY2+3.

@gvanrossum
Copy link
Member Author

@vlasovskikh Thanks for the responses. I have to ponder these more, probably after making more changes to mypy to actually implement the proposal.

@gvanrossum
Copy link
Member Author

I'm going to close this without merging. I don't think I'll every complete this in a satisfactory way, and I haven't found the need for this. Maybe in the future, but I kind of doubt it.

@gvanrossum gvanrossum closed this Nov 9, 2016
@vlasovskikh
Copy link
Member

@gvanrossum So what do you think of python/typing#208 now?

@gvanrossum gvanrossum deleted the strbyt branch February 7, 2017 18:47
hswong3i pushed a commit to alvistack/python-typeshed that referenced this pull request May 25, 2025
…ython#580)

Fixes python#579 .

When performing an `issubclass` check of a type against a protocol, the `__annotations__` member of the type is accessed and assumed to be iterable. `__annotations__` is a descriptor in the case of `types.FunctionType`, so while it is iterable when accessed on a function instance it is not iterable when accessed on the type of a function. This causes the `issubclass` check to fail with an exception.

In some cases (AFAICT, non-data protocols), an `isinstance` check of an object will use, internally, a subclass check of the object's type. As a result, `isinstance` will also fail with an exception in these conditions.

The above only seemed to occur in Python 3.

This PR fixes the issue in the Python 3 implementation while adding test coverage for both Python 2 and 3 that ensures that functions (and `types.FunctionType`) can be correctly compared both against protocols that they legitimately implement as well as those that they do not implement.
hswong3i pushed a commit to alvistack/python-typeshed that referenced this pull request May 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0