-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Decide how to deal with str/unicode #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe a fifth approach: write 'basestring' instead of Union[str, unicode](in places where that's preferred over AnyStr). That's how Python 2 code is |
Let me sketch how this would work:
The main new challenge compared to the first approach is that it may be difficult to make it practical to use |
An executive summary of the Good: user code can opt for strict Bad: an extra string type makes things a little more complicated, stubs are harder to write, typeshed needs a major update, code using |
I think AnyStr is probably the best approach for most cases, especially
when the corresponding PY3 method uses it.
But Matthias expressed a desire to use Union[str, unicode] in some places
where a type variable is inappropriate. One example for this that I
discovered is samefile(a, b) -> bool; this just calls os.stat(a) and
os.stat(b), and there's no need for a and b to have the same type, even in
PY3 (despite what typeshed/stdlib/3/os/path.pyi currently says).
So I'm merely proposing basestring as an acceptable shortcut for Union[str,
unicode] on PY2. (PY3 doesn't have it at all.)
Off-topic: eventually I think it would be cool to have a way to run mypy
and pytypes in "straddling" mode, where it checks compatibility with PY2
and PY3 simultaneously. Although I suppose you could emulate this by
running it twice and merging the output.
|
Ah okay. I'm not excited about having union type as the default way of doing anything common, as they are unintuitive to use, even if there is a type alias behind them. If we are going to have a |
As a bit of data on approach 1, several of the most frequently-hit exceptions in the Dropbox server codebase are UnicodeDecodeErrors and UnicodeEncodeErrors. It's not easy to tell for all of them, but many of them seem to be because a |
Approaches 3 and 4 would give the most effective detection of Unicode related errors. Approaches 2 and 5 would give some detection but built-in methods would probably have to be permissive and there would still be room for runtime errors. However, to adopt these strict approaches existing code may need a lot of work to pass type checking. |
So while in theory the return type of .encode() and .decode() depend on basestring.encode(enc: str) -> bytes |
I'm with Greg on this one. We want to explicitly distinguish between Also, |
I suggest that we take a few steps back and think about what we are trying to achieve via type checking. Things that mypy can type check right now (using Python 3 syntax but Python 2 semantics): a) Passing
b) Adding
These would be easy to support with small stub changes: c) Decode
d) Encode
Here are some additional things that we could catch if mypy did things differently:
|
[I submitted the previous comment accidentally when I was still writing it. I edited it afterwards but you may get an email with the old contents.] As followup to the previous comment, we should decide whether we want to catch issues like 1, 2, 3 and 4. Just removing Also, as I mentioned earlier, we should look into how the changes affect type checking existing code, as there are some subtle interactions. The
Currently the function type checks fine, because
The modified version is much longer and also less efficient. I'm not sure if people would be happy to modify their code like that -- they might prefer to use Also, consider a naive version of the annotated
This looks fine on the surface, but the call to However, consider a slightly different
This might actually not generate any type check errors (depending on how we deal with 1) above), even though it can fail similarly at runtime to the original |
Another random idea: We could have an internal type called
The |
I like having a variety of string types to express the various cases that occur in real PY2 code, while still catching as many bytes/unicode errors as possible. The error cases are mostly code that mixes str and unicode instances assuming that the str instances are 7-bit while they occasionally contain 8-bit data. They will raise some UnicodeError only when given 8-bit data. The false positives for such error checks would be code that looks the same as above but in reality only gets fed 7-bit data. My intuition is that the most common case for that is when the data is always coming from 7-bit string literals. So having a separate type inferred for 7-bit literals would be very useful. I also like distinguishing between str and bytes even though they are the same class -- clarifying intention is the reason the bytes alias was introduced. (Note there's also a b'...' literal.) Maybe we can draw up a class hierarchy including all the types we care about? I think the set of types would be:
I think it's non-controversial that basestring is the base of them all, with bytes and unicode as distinct subtypes, but beyond that it's tricky. Maybe str could be another distinct subtype of basestring, with ascii_str being a subtype of all? I.e.
We would strive for most code to be annotated using mostly bytes and unicode, or str when it's "old school text processing". I'm fine with not having a way for user code to spell ascii_str, as long as it's inferred appropriately for literals. I guess AnyStr should have all four as constraints! It would be useful for many os and os.path functions, and many things that call them. Code would be prevented to mix bytes, str and unicode -- mixing bytes or str with unicode is the source of UnicodeErrors, while mixing bytes with str point to conceptual impurity that needs to be resolved before the code can be ported to PY3. In order to make it easier to write straddling code, we can define typing.unicode as an alias for builtins.unicode in PY2, and for builtins.str in PY3. (I prefer this over borrowing the string types from six -- not everyone uses six in their straddling code.) I expect that this scheme will cause a certain amount of pain when starting to annotate "wild west" (I mean "real world" :-) PY2 code, but the pain should be exactly in the places that would cause problems when porting to PY3 (or straddling). I wonder if we could experiment with a new approach using a different command line flag? E.g. --py2-strict. |
I've thought more about
Also this:
Yet another:
We could perhaps always promote inferred
Another example:
Note that in the above example We also need to decide it we'd disallow things like Finally, what about Python 2 library functions that only accept To experiment with all this, we may have to create separate stubs (at least for builtins) that follow the new conventions. The command-line option would enable the new stubs in addition to the new type checking mode. If/when we can pull this off, we could plausibly share many stdlib stubs between Python 2 and 3, at least once mypy also supports conditional Python version checks. Because of the complications mentioned above, perhaps we should just type check in both Python 2 and Python 3 modes in parallel and merge the results somehow? Python 2 mode could be permissive, since Python 3 mode would potentially catch most issues that could affect Python 2 as well. Not sure about the latter, though. |
This issue is becoming important in the context of newly-added Python 2.7 + Python 3 compatible comment-based syntax. See python/typing#19. |
Yes, it's important, but I'm also at a loss how to solve it. :-( That's why I didn't add it to the typing 3.5.2 milestone yet. Maybe you can bring it up on python-ideas? Some very bright minds there. |
I've started a discussion at python-ideas "Type hints for text/binary data in Python 2+3 code" and posted a draft based on the ideas discussed here + some ideas based on the usage of our old type system in PyCharm. |
@JukkaL I would really like to hear your feedback on this proposal based on your approach (1). I've posted it to python-ideas as a reply to "Type hints for text/binary data in Python 2+3 code". |
At Dropbox we are working with a large Python 2.7 codebase that has no chance of being type-checked in Python 3 mode (the number of errors would be too overwhelming). Yet we would like to find bugs in code that relies on implicit str<->unicode conversions. So I'm still hoping we'll be able to do something better than allow all such implicit conversions. I also hope we're not unique -- surely many legacy codebases exist where these implicit conversions are a major block for moving to Python 3. I guess we could use a non-standard copy of typeshed where str and bytes are separate types, and/or a non-standard version of mypy that always complains about implicit str<->unicode conversions? I'll have to think more about this. It would also be nice if encode()/decode() calls were only allowed in the direction that Python 3 allows them; I've seen many people utterly confused by code that was doing clever things with them. |
If we make However, I agree with Guido that it would be even nicer to type check code in Python 2 mode and catch more unicode-related errors. The proposal with most promise in my mind is similar to @vlasovskikh's first proposal with Here are my reasons why I'd rather do better type checking in Python 2 mode:
Here are some of my thoughts about experimenting with the better Python 2 checking mode using mypy:
The above rules will still reject things like |
Another point in favor of
With an It would also be awesome if mypy could infer this narrowing on its own so I wouldn't need the explicit cast. That's probably asking too much but it would be possible to determine that after the |
In that example, does f() always return a string? In that case AnyStr would I agree that the cast should ideally not be necessary; can you file that If the cast feels too expensive, I think you could rewrite it using a new
A plain assignment is way faster than a UPDATE: Why does GitHub not accept MarkDown from email? |
Right, I don't want the I'm not too concerned about the cost of the |
If anything, `x # type: T` should mean a variable declaration without a
value. But you can't declare a variable that already has been declared.
|
The cast should be redundant in the original example. If it isn't, it's a bug -- but I can't repro. |
I have a new proposal. In Python 2.7, there's a This isn't as thorough as the idea of introducing an "ascii bytes" type (used for str literals containing no non-ASCII bytes, and treated as a subclass of An even quicker experiment has suggested that the bytes-str-unicode idea is more feasible. (However, it requires a change to typed_ast.) UPDATE: The basic idea is that unconverted Python 2 code can use |
The new proposal doesn't seem to address one of the issues we tried to solve with the alternative proposal involving special types for ASCII-only strings. Consider this code:
The precise type for the second My alternative proposal would infer type
We may want to make it possible to override these with an annotation, so these would be valid as well:
The details of all this are still a little unclear to me, and I don't really know if the ascii-only type proposal would work in practice. At least it seems to solve the |
@JukkaL do you have a way to make the type of |
My idea would be to do two things:
|
I'm still skeptical about AsciiBytes, although it would be nice if I could do the experiment over with your proposed promotions. My problem is that nobody writes Now, I haven't completely worked out the best rules to use with my proposal, and maybe the rules should not allow silent propagation from unicode to str (only from str to unicode). But I'm still pretty skeptical that AsciiBytes is going to help much for writing more portable code, and I'm also skeptical that it would catch enough pure Python 2 issues without being annoying. (But I realize that right now it's annoying for a reason we can fix.) |
However, the code is questionable, as some binary file-like objects might not accept ascii unicode objects. This may be underspecified in the Python documentation. Let's look at how the I assume that in cases like For Subtyping would no longer be be transitive -- Library functions that accept
All the above calls should arguably be okay if the functions don't modify the arguments. The issue is less relevant for user code, since we can assume that users consistently use either
Of course, we could just add the definitions (named It's a little unclear what to do with
Another example:
Yet another:
|
We talked about this offline for a while. We haven't decided yet, but it appears I had poorly explained my bytes/str/text proposal and Jukka's response was not so relevant. I want to move the discussion to python/typing#208 so I'll continue there. |
Closing this issue since discussion has moved to python/typing#208 and the resolution is to keep the current behavior. |
It seems that AnyStr has the main advantage of ensuring that multiple arguments (or argument(s) and return values) have the same type (i.e. `str` and `bytes` are both okay but shouldn't be mixed). However it makes it significantly more difficult to cast to a single type. With a Union, something like: ```python def foo(a: Union[bytes, Text]) -> Text: if isinstance(a, bytes): a = a.decode() return "Hello, " + a ``` works fine. With AnyStr, you have to define a new intermediate variable, something like: ```python def foo(a: AnyStr) -> str: if isinstance(a, bytes): b = a.decode() else: b = a return "Hello, " + b ``` If there's no advantage to using AnyStr (since there's no other AnyStr argument or return value), I think the Union would be simpler, have no significant disadvantage, and there is [precedent for similar changes](python#1054). > It's only the case that if there's exactly one parameter of type exactly AnyStr, and no other use of AnyStr in the signature, then Union[str, bytes] should be acceptable. - [gvanrossum](python#439 (comment)) Discussion: - python#1054 - python/mypy#1141 - python#439
We should agree on how we expect the string types to be used in Python 2 code.
There are at least four ways we can approach this:
str
usually valid whenunicode
is expected. This is how mypy currently works, and this is similar to how PEP 484 definesbytearray
/bytes
compatibility. This will correspond to runtime semantics, but it's not safe as non-ascii characters instr
objects will result in programs sometimes blowing up. A 7-bitstr
instance is almost always valid at runtime whenunicode
is expected.str -> unicode
promotion and useUnion[str, unicode]
everywhere (or create an alias for it). This is almost like approach 1, except that we have a different name forunicode
and more complex error messages and a complex programming model due to the proliferation of union types. There is potential for some additional type safety by using justunicode
in user code.str
/unicode
distinction in Python 2 code, similar to Python 3 (str
would behave more or less like Python 3bytes
), and discourage union types. This will make it harder to annotate existing Python 2 programs which often use the two types almost interchangeably, but it will make programs safer.bytes
(distinct from fromstr
) means 8-bitstr
instances -- these aren't compatible withunicode
.str
means asciistr
instances. These are compatible withbytes
andunicode
, but not the other way around.unicode
meansunicode
instances and isn't special. A string literal will have implicit typestr
orbytes
depending on whether it only has ascii characters. This approach should be pretty safe and potentially also makes it fairly easy to adapt existing code, but harder than with approach 1.These also affect how stubs should be written and thus it would be best if every tool using typeshed could use the same approach:
str
,unicode
orAnyStr
. This is how many stubs are written already.str
,Uniont[str, unicode]
orAnyStr
for attributes and function arguments, and return types could additionally use plainunicode
. Return types would in general be hard to specify precisely, as it may be difficult to predict whether a function called withstr
or combination ofstr
andunicode
returnsstr
,unicode
orUnion[str, unicode]
. In approach 1 we can safely fall back tounicode
if unsure.AnyStr
would be less useful as we could have mixed function arguments like(str, unicode)
easily (see the typeshed issues mentioned below for more about this).str
,unicode
orAnyStr
, butunicode
wouldn't accept plainstr
objects.bytes
,str
,unicode
) in addition toAnyStr
, and these would all behave differently. Unlike the first three approaches,AnyStr
would range overstr
,unicode
andbytes
in Python 2 mode.Note that mypy currently assumes approach 1 and I don't know how well the other approaches would work in practice.
[This was adapted from a comment on #1135; see the original issue for more discussion. Also, https://github.com/python/typeshed/issues/50 is relevant.]
The text was updated successfully, but these errors were encountered: