[WIP] introduce a distinct bytes type. #580

gvanrossum · 2016-10-01T01:34:06Z

gvanrossum

I expect this is only the tip of the iceberg, revealed by mypy's test suite which doesn't exercise the rest of the stdlib much.

gvanrossum · 2016-10-01T01:35:01Z

stdlib/2.7/__builtin__.pyi

@@ -114,6 +114,8 @@ class int(SupportsInt, SupportsFloat, SupportsAbs[int]):
    def __abs__(self) -> int: ...
    def __hash__(self) -> int: ...

+long = int


I just moved this to a more logical place.

gvanrossum · 2016-10-01T01:37:33Z

stdlib/2.7/__builtin__.pyi

@@ -210,7 +212,7 @@ class unicode(basestring, Sequence[unicode]):
    def center(self, width: int, fillchar: unicode = u' ') -> unicode: ...
    def count(self, x: unicode) -> int: ...
    def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
-    def encode(self, encoding: unicode = ..., errors: unicode = ...) -> str: ...
+    def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...


encode() is just about the strongest signal that we want bytes, apart from using b'' literals.

Should things like encoding have the type Text or NativeStr (or even str)? Consider various cases:

encoding: Text. If we have 'x' -> NativeStr then u'foo'.encode('utf8') would be a type error, since NativeStr is not compatible with Text.

encoding: NativeStr. Then u'foo'.encode(u'utf8') would result in a false positive.

encoding: str. Then we really don't have any type checks, any string-like object will be OK.

The idea of adding NativeStr was that in PY2 it represents cases where an implicit unicode->bytes ASCII conversion takes place. And this is the case for encoding in unicode.encode().

But having to write u'foo'.encode(NativeStr(u'utf8')) is not that convenient. (Here the ASCII types proposal would have worked better, but only when you use literals).

On the other hand, if you have your encoding name as a string parameter (as opposed to a literal, and maybe it's a more important case overall) and you make encoding in unicode.encode() to be of type NativeStr (as its run-time semantics suggests), then the cases are like that:

def f(encoding: Text): u'foo'.encode(encoding). You have to use u'foo'.encode(encoding.encode('ascii') if PY2 else encoding). Here you have to make your implicit conversion explicit by saying that you believe that this text can always be ASCII-converted to a 8-bit string. This way you make your code PY2+3 compatible.

def f(encoding: NativeStr): u'foo'.encode(encoding). Everything looks good here.

def f(encoding: str): u'foo'.encode(encoding). Again everything is OK, but it looks like another way of saying that you are sure that you are ASCII-compatible here.

So, a possible conclusion from this is: if you are sure that your argument gets implicitly ASCII-converted to a 8-bit string, then you should use NativeStr for it. If you want to state that you don't really care about checking for implicit ASCII conversions (and PY2+3 compatibility in general), you should use str.

gvanrossum · 2016-10-01T01:38:11Z

stdlib/2.7/__builtin__.pyi

@@ -366,11 +368,102 @@ class str(basestring, Sequence[str]):
    def __ge__(self, x: unicode) -> bool: ...
    def __mod__(self, x: Any) -> str: ...

+class bytes(basestring, Sequence[bytes]):
+    # TODO: double-check unicode, AnyStr, unions, bytearray


So far this is just a clone of str() above with all occurrences of str replaced by bytes.

gvanrossum · 2016-10-01T01:39:01Z

stdlib/2.7/__builtin__.pyi

+    def center(self, width: int, fillchar: bytes = ...) -> bytes: ...
+    def count(self, x: unicode) -> int: ...
+    def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
+    def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...


Hm, bytes probably shouldn't have encode() at all.

gvanrossum · 2016-10-01T01:39:42Z

stdlib/2.7/__builtin__.pyi

+    def count(self, x: unicode) -> int: ...
+    def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
+    def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...
+    def endswith(self, suffix: Union[unicode, Tuple[unicode, ...]]) -> bool: ...


This should probably use Union[bytes, Tuple[bytes, ...]], and similar for most other uses of unicode in this class.

gvanrossum · 2016-10-01T02:01:05Z

stdlib/2.7/__builtin__.pyi

@@ -913,8 +1003,8 @@ class file(BinaryIO):
    def __init__(self, file: unicode, mode: str = 'r', buffering: int = ...) -> None: ...


Hm, maybe we can commandeer 'file' (a PY2-only class) to be the subclass of IO[str] I wished for above.

gvanrossum · 2016-10-01T02:02:11Z

stdlib/2.7/io.pyi

@@ -18,30 +18,30 @@ class IOBase:
    ...

 class BytesIO(BinaryIO):
-    def __init__(self, initial_bytes: str = ...) -> None: ...
+    def __init__(self, initial_bytes: bytes = ...) -> None: ...


Everything in this file honestly looks like an improvement (I may even keep it regardless of what we do here, since bytes would just be an alias for str).

gvanrossum · 2016-10-01T02:04:01Z

stdlib/2.7/io.pyi

@@ -74,7 +74,7 @@ class StringIO(TextIO):

 class TextIOWrapper(TextIO):
    # write_through is undocumented but used by subprocess
-    def __init__(self, buffer: IO[str], encoding: unicode = ...,
+    def __init__(self, buffer: IO[bytes], encoding: unicode = ...,


Hm, I've got a feeling that the buffer argument here is actually more restricted than IO[] -- it probably needs to be derived from BufferedIOBase.

gvanrossum · 2016-10-01T02:06:16Z

stdlib/2.7/re.pyi

@@ -34,63 +34,63 @@ def compile(pattern: AnyStr, flags: int = ...) -> Pattern[AnyStr]: ...
 def compile(pattern: Pattern[AnyStr], flags: int = ...) -> Pattern[AnyStr]: ...

 @overload
-def search(pattern: Union[str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...
+def search(pattern: Union[bytes, str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...


Every API in this file got amended to add bytes to the union or overload. Beware, it's the pattern that's got the unions -- the search target uses AnyStr and the return type varies with that. But (at least in PY2) the pattern can vary independently.

gvanrossum · 2016-10-01T02:09:23Z

stdlib/2.7/typing.pyi

@@ -30,7 +30,7 @@ DefaultDict = TypeAlias(object)
 Set = TypeAlias(object)

 # Predefined type variables.
-AnyStr = TypeVar('AnyStr', str, unicode)
+AnyStr = TypeVar('AnyStr', bytes, str, unicode)


This is key, but also annoying. E.g. this code breaks because of this:

def f(a): # type: (AnyStr) -> AnyStr if isinstance(a, str): return '.' else: return u'.'

You have to add a third case (elif isinstance(a, bytes): return b'.'). You can'y even collapse this as isinstance(a, (str, bytes)) because the return type (in this example) must match the argument.

vlasovskikh · 2016-10-05T20:46:23Z

stdlib/2.7/__builtin__.pyi

@@ -210,7 +212,7 @@ class unicode(basestring, Sequence[unicode]):
    def center(self, width: int, fillchar: unicode = u' ') -> unicode: ...
    def count(self, x: unicode) -> int: ...
    def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ...
-    def encode(self, encoding: unicode = ..., errors: unicode = ...) -> str: ...
+    def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ...


Should things like encoding have the type Text or NativeStr (or even str)? Consider various cases:

encoding: Text. If we have 'x' -> NativeStr then u'foo'.encode('utf8') would be a type error, since NativeStr is not compatible with Text.

encoding: NativeStr. Then u'foo'.encode(u'utf8') would result in a false positive.

encoding: str. Then we really don't have any type checks, any string-like object will be OK.

The idea of adding NativeStr was that in PY2 it represents cases where an implicit unicode->bytes ASCII conversion takes place. And this is the case for encoding in unicode.encode().

But having to write u'foo'.encode(NativeStr(u'utf8')) is not that convenient. (Here the ASCII types proposal would have worked better, but only when you use literals).

On the other hand, if you have your encoding name as a string parameter (as opposed to a literal, and maybe it's a more important case overall) and you make encoding in unicode.encode() to be of type NativeStr (as its run-time semantics suggests), then the cases are like that:

def f(encoding: Text): u'foo'.encode(encoding). You have to use u'foo'.encode(encoding.encode('ascii') if PY2 else encoding). Here you have to make your implicit conversion explicit by saying that you believe that this text can always be ASCII-converted to a 8-bit string. This way you make your code PY2+3 compatible.

def f(encoding: NativeStr): u'foo'.encode(encoding). Everything looks good here.

def f(encoding: str): u'foo'.encode(encoding). Again everything is OK, but it looks like another way of saying that you are sure that you are ASCII-compatible here.

So, a possible conclusion from this is: if you are sure that your argument gets implicitly ASCII-converted to a 8-bit string, then you should use NativeStr for it. If you want to state that you don't really care about checking for implicit ASCII conversions (and PY2+3 compatibility in general), you should use str.

vlasovskikh · 2016-10-05T20:50:47Z

stdlib/2.7/__builtin__.pyi

@@ -716,11 +806,11 @@ def next(i: Iterator[_T]) -> _T: ...
 def next(i: Iterator[_T], default: _T) -> _T: ...
 def oct(i: int) -> str: ...  # TODO __index__
 @overload
-def open(file: str, mode: str = 'r', buffering: int = ...) -> BinaryIO: ...
+def open(file: str, mode: str = 'r', buffering: int = ...) -> IO[str]: ...


Shouldn't it return IO[bytes] in PY2? As you've mentioned, the data in PY2 is always 8-bit strings, so open() clearly behaves way too different in PY2 and PY3 except for open(..., '...b') which is bytes in PY2+3. There are more compatible functions codecs.open() and io.open() where you can return bytes or Text depending on the mode parameter for both PY2+3.

gvanrossum · 2016-10-05T22:23:29Z

@vlasovskikh Thanks for the responses. I have to ponder these more, probably after making more changes to mypy to actually implement the proposal.

gvanrossum · 2016-11-09T03:46:45Z

I'm going to close this without merging. I don't think I'll every complete this in a satisfactory way, and I haven't found the need for this. Maybe in the future, but I kind of doubt it.

vlasovskikh · 2016-11-09T09:46:22Z

@gvanrossum So what do you think of python/typing#208 now?

…ython#580) Fixes python#579 . When performing an `issubclass` check of a type against a protocol, the `__annotations__` member of the type is accessed and assumed to be iterable. `__annotations__` is a descriptor in the case of `types.FunctionType`, so while it is iterable when accessed on a function instance it is not iterable when accessed on the type of a function. This causes the `issubclass` check to fail with an exception. In some cases (AFAICT, non-data protocols), an `isinstance` check of an object will use, internally, a subclass check of the object's type. As a result, `isinstance` will also fail with an exception in these conditions. The above only seemed to occur in Python 3. This PR fixes the issue in the Python 3 implementation while adding test coverage for both Python 2 and 3 that ensures that functions (and `types.FunctionType`) can be correctly compared both against protocols that they legitimately implement as well as those that they do not implement.

[WIP] introduce a distinct bytes type.

455f8aa

gvanrossum mentioned this pull request Oct 1, 2016

Decide how to handle str/unicode python/typing#208

Closed

gvanrossum commented Oct 1, 2016

View reviewed changes

gvanrossum mentioned this pull request Oct 1, 2016

[WIP] bytes/str/unicode python/mypy#2203

Closed

vlasovskikh suggested changes Oct 5, 2016

View reviewed changes

gvanrossum closed this Nov 9, 2016

gvanrossum deleted the strbyt branch February 7, 2017 18:47

hswong3i pushed a commit to alvistack/python-typeshed that referenced this pull request May 25, 2025

Implement support for PEP 764 (inline typed dictionaries) (python#580)

28f08ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] introduce a distinct bytes type. #580

[WIP] introduce a distinct bytes type. #580

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -913,8 +1003,8 @@ class file(BinaryIO):
		def __init__(self, file: unicode, mode: str = 'r', buffering: int = ...) -> None: ...

Uh oh!

[WIP] introduce a distinct bytes type. #580

[WIP] introduce a distinct bytes type. #580

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!