-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[WIP] introduce a distinct bytes type. #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this is only the tip of the iceberg, revealed by mypy's test suite which doesn't exercise the rest of the stdlib much.
@@ -114,6 +114,8 @@ class int(SupportsInt, SupportsFloat, SupportsAbs[int]): | |||
def __abs__(self) -> int: ... | |||
def __hash__(self) -> int: ... | |||
|
|||
long = int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this to a more logical place.
@@ -210,7 +212,7 @@ class unicode(basestring, Sequence[unicode]): | |||
def center(self, width: int, fillchar: unicode = u' ') -> unicode: ... | |||
def count(self, x: unicode) -> int: ... | |||
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ... | |||
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> str: ... | |||
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode() is just about the strongest signal that we want bytes, apart from using b''
literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should things like encoding
have the type Text
or NativeStr
(or even str
)? Consider various cases:
encoding: Text
. If we have'x' -> NativeStr
thenu'foo'.encode('utf8')
would be a type error, sinceNativeStr
is not compatible withText
.encoding: NativeStr
. Thenu'foo'.encode(u'utf8')
would result in a false positive.encoding: str
. Then we really don't have any type checks, any string-like object will be OK.
The idea of adding NativeStr
was that in PY2 it represents cases where an implicit unicode->bytes ASCII conversion takes place. And this is the case for encoding
in unicode.encode()
.
But having to write u'foo'.encode(NativeStr(u'utf8'))
is not that convenient. (Here the ASCII types proposal would have worked better, but only when you use literals).
On the other hand, if you have your encoding name as a string parameter (as opposed to a literal, and maybe it's a more important case overall) and you make encoding
in unicode.encode()
to be of type NativeStr
(as its run-time semantics suggests), then the cases are like that:
def f(encoding: Text): u'foo'.encode(encoding)
. You have to useu'foo'.encode(encoding.encode('ascii') if PY2 else encoding)
. Here you have to make your implicit conversion explicit by saying that you believe that this text can always be ASCII-converted to a 8-bit string. This way you make your code PY2+3 compatible.def f(encoding: NativeStr): u'foo'.encode(encoding)
. Everything looks good here.def f(encoding: str): u'foo'.encode(encoding)
. Again everything is OK, but it looks like another way of saying that you are sure that you are ASCII-compatible here.
So, a possible conclusion from this is: if you are sure that your argument gets implicitly ASCII-converted to a 8-bit string, then you should use NativeStr
for it. If you want to state that you don't really care about checking for implicit ASCII conversions (and PY2+3 compatibility in general), you should use str
.
@@ -366,11 +368,102 @@ class str(basestring, Sequence[str]): | |||
def __ge__(self, x: unicode) -> bool: ... | |||
def __mod__(self, x: Any) -> str: ... | |||
|
|||
class bytes(basestring, Sequence[bytes]): | |||
# TODO: double-check unicode, AnyStr, unions, bytearray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far this is just a clone of str()
above with all occurrences of str
replaced by bytes
.
def center(self, width: int, fillchar: bytes = ...) -> bytes: ... | ||
def count(self, x: unicode) -> int: ... | ||
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ... | ||
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, bytes probably shouldn't have encode() at all.
def count(self, x: unicode) -> int: ... | ||
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ... | ||
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ... | ||
def endswith(self, suffix: Union[unicode, Tuple[unicode, ...]]) -> bool: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably use Union[bytes, Tuple[bytes, ...]]
, and similar for most other uses of unicode in this class.
@@ -913,8 +1003,8 @@ class file(BinaryIO): | |||
def __init__(self, file: unicode, mode: str = 'r', buffering: int = ...) -> None: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe we can commandeer 'file' (a PY2-only class) to be the subclass of IO[str] I wished for above.
@@ -18,30 +18,30 @@ class IOBase: | |||
... | |||
|
|||
class BytesIO(BinaryIO): | |||
def __init__(self, initial_bytes: str = ...) -> None: ... | |||
def __init__(self, initial_bytes: bytes = ...) -> None: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in this file honestly looks like an improvement (I may even keep it regardless of what we do here, since bytes would just be an alias for str).
@@ -74,7 +74,7 @@ class StringIO(TextIO): | |||
|
|||
class TextIOWrapper(TextIO): | |||
# write_through is undocumented but used by subprocess | |||
def __init__(self, buffer: IO[str], encoding: unicode = ..., | |||
def __init__(self, buffer: IO[bytes], encoding: unicode = ..., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I've got a feeling that the buffer argument here is actually more restricted than IO[] -- it probably needs to be derived from BufferedIOBase.
@@ -34,63 +34,63 @@ def compile(pattern: AnyStr, flags: int = ...) -> Pattern[AnyStr]: ... | |||
def compile(pattern: Pattern[AnyStr], flags: int = ...) -> Pattern[AnyStr]: ... | |||
|
|||
@overload | |||
def search(pattern: Union[str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ... | |||
def search(pattern: Union[bytes, str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every API in this file got amended to add bytes to the union or overload. Beware, it's the pattern that's got the unions -- the search target uses AnyStr and the return type varies with that. But (at least in PY2) the pattern can vary independently.
@@ -30,7 +30,7 @@ DefaultDict = TypeAlias(object) | |||
Set = TypeAlias(object) | |||
|
|||
# Predefined type variables. | |||
AnyStr = TypeVar('AnyStr', str, unicode) | |||
AnyStr = TypeVar('AnyStr', bytes, str, unicode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is key, but also annoying. E.g. this code breaks because of this:
def f(a): # type: (AnyStr) -> AnyStr
if isinstance(a, str): return '.'
else: return u'.'
You have to add a third case (elif isinstance(a, bytes): return b'.'
). You can'y even collapse this as isinstance(a, (str, bytes))
because the return type (in this example) must match the argument.
@@ -210,7 +212,7 @@ class unicode(basestring, Sequence[unicode]): | |||
def center(self, width: int, fillchar: unicode = u' ') -> unicode: ... | |||
def count(self, x: unicode) -> int: ... | |||
def decode(self, encoding: unicode = ..., errors: unicode = ...) -> unicode: ... | |||
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> str: ... | |||
def encode(self, encoding: unicode = ..., errors: unicode = ...) -> bytes: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should things like encoding
have the type Text
or NativeStr
(or even str
)? Consider various cases:
encoding: Text
. If we have'x' -> NativeStr
thenu'foo'.encode('utf8')
would be a type error, sinceNativeStr
is not compatible withText
.encoding: NativeStr
. Thenu'foo'.encode(u'utf8')
would result in a false positive.encoding: str
. Then we really don't have any type checks, any string-like object will be OK.
The idea of adding NativeStr
was that in PY2 it represents cases where an implicit unicode->bytes ASCII conversion takes place. And this is the case for encoding
in unicode.encode()
.
But having to write u'foo'.encode(NativeStr(u'utf8'))
is not that convenient. (Here the ASCII types proposal would have worked better, but only when you use literals).
On the other hand, if you have your encoding name as a string parameter (as opposed to a literal, and maybe it's a more important case overall) and you make encoding
in unicode.encode()
to be of type NativeStr
(as its run-time semantics suggests), then the cases are like that:
def f(encoding: Text): u'foo'.encode(encoding)
. You have to useu'foo'.encode(encoding.encode('ascii') if PY2 else encoding)
. Here you have to make your implicit conversion explicit by saying that you believe that this text can always be ASCII-converted to a 8-bit string. This way you make your code PY2+3 compatible.def f(encoding: NativeStr): u'foo'.encode(encoding)
. Everything looks good here.def f(encoding: str): u'foo'.encode(encoding)
. Again everything is OK, but it looks like another way of saying that you are sure that you are ASCII-compatible here.
So, a possible conclusion from this is: if you are sure that your argument gets implicitly ASCII-converted to a 8-bit string, then you should use NativeStr
for it. If you want to state that you don't really care about checking for implicit ASCII conversions (and PY2+3 compatibility in general), you should use str
.
@@ -716,11 +806,11 @@ def next(i: Iterator[_T]) -> _T: ... | |||
def next(i: Iterator[_T], default: _T) -> _T: ... | |||
def oct(i: int) -> str: ... # TODO __index__ | |||
@overload | |||
def open(file: str, mode: str = 'r', buffering: int = ...) -> BinaryIO: ... | |||
def open(file: str, mode: str = 'r', buffering: int = ...) -> IO[str]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it return IO[bytes]
in PY2? As you've mentioned, the data in PY2 is always 8-bit strings, so open()
clearly behaves way too different in PY2 and PY3 except for open(..., '...b')
which is bytes
in PY2+3. There are more compatible functions codecs.open()
and io.open()
where you can return bytes
or Text
depending on the mode parameter for both PY2+3.
@vlasovskikh Thanks for the responses. I have to ponder these more, probably after making more changes to mypy to actually implement the proposal. |
I'm going to close this without merging. I don't think I'll every complete this in a satisfactory way, and I haven't found the need for this. Maybe in the future, but I kind of doubt it. |
@gvanrossum So what do you think of python/typing#208 now? |
…ython#580) Fixes python#579 . When performing an `issubclass` check of a type against a protocol, the `__annotations__` member of the type is accessed and assumed to be iterable. `__annotations__` is a descriptor in the case of `types.FunctionType`, so while it is iterable when accessed on a function instance it is not iterable when accessed on the type of a function. This causes the `issubclass` check to fail with an exception. In some cases (AFAICT, non-data protocols), an `isinstance` check of an object will use, internally, a subclass check of the object's type. As a result, `isinstance` will also fail with an exception in these conditions. The above only seemed to occur in Python 3. This PR fixes the issue in the Python 3 implementation while adding test coverage for both Python 2 and 3 that ensures that functions (and `types.FunctionType`) can be correctly compared both against protocols that they legitimately implement as well as those that they do not implement.
See python/typing#208