8000 Decide how to deal with str/unicode · Issue #1141 · python/mypy · GitHub
[go: up one dir, main page]

Skip to content

Decide how to deal with str/unicode #1141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JukkaL opened this issue Jan 21, 2016 · 33 comments
Closed

Decide how to deal with str/unicode #1141

JukkaL opened this issue Jan 21, 2016 · 33 comments

Comments

@JukkaL
Copy link
Collaborator
JukkaL commented Jan 21, 2016

We should agree on how we expect the string types to be used in Python 2 code.

There are at least four ways we can approach this:

  1. Make str usually valid when unicode is expected. This is how mypy currently works, and this is similar to how PEP 484 defines bytearray / bytes compatibility. This will correspond to runtime semantics, but it's not safe as non-ascii characters in str objects will result in programs sometimes blowing up. A 7-bit str instance is almost always valid at runtime when unicode is expected.
  2. Get rid of the str -> unicode promotion and use Union[str, unicode] everywhere (or create an alias for it). This is almost like approach 1, except that we have a different name for unicode and more complex error messages and a complex programming model due to the proliferation of union types. There is potential for some additional type safety by using just unicode in user code.
  3. Enforce explicit str / unicode distinction in Python 2 code, similar to Python 3 (str would behave more or less like Python 3 bytes), and discourage union types. This will make it harder to annotate existing Python 2 programs which often use the two types almost interchangeably, but it will make programs safer.
  4. Have three different string types: bytes (distinct from from str) means 8-bit str instances -- these aren't compatible with unicode. str means ascii str instances. These are compatible with bytes and unicode, but not the other way around. unicode means unicode instances and isn't special. A string literal will have implicit type str or bytes depending on whether it only has ascii characters. This approach should be pretty safe and potentially also makes it fairly easy to adapt existing code, but harder than with approach 1.

These also affect how stubs should be written and thus it would be best if every tool using typeshed could use the same approach:

  • For approach 1, stubs should usually use str, unicode or AnyStr. This is how many stubs are written already.
  • For approach 2, stubs should use str, Uniont[str, unicode] or AnyStr for attributes and function arguments, and return types could additionally use plain unicode. Return types would in general be hard to specify precisely, as it may be difficult to predict whether a function called with str or combination of str and unicode returns str, unicode or Union[str, unicode]. In approach 1 we can safely fall back to unicode if unsure. AnyStr would be less useful as we could have mixed function arguments like (str, unicode) easily (see the typeshed issues mentioned below for more about this).
  • For approach 3, stubs would usually use either str, unicode or AnyStr, but unicode wouldn't accept plain str objects.
  • For approach 4, stubs could use three different types (bytes, str, unicode) in addition to AnyStr, and these would all behave differently. Unlike the first three approaches, AnyStr would range over str, unicode and bytes in Python 2 mode.

Note that mypy currently assumes approach 1 and I don't know how well the other approaches would work in practice.

[This was adapted from a comment on #1135; see the original issue for more discussion. Also, https://github.com/python/typeshed/issues/50 is relevant.]

@gvanrossum
Copy link
Member

Maybe a fifth approach: write 'basestring' instead of Union[str, unicode](in places where that's preferred over AnyStr). That's how Python 2 code is
supposed to test for either, e.g. isinstance(x, basestring).

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 21, 2016

basestring would likely be quite similar to approach 1. We'd just a have different name for the type. Also, unicode would be a separate type -- that would have its pros and cons. Note that basestring doesn't actually define any methods so we'd be lying a bit by using it as a type, as it's not a real ABC, but I doubt it matters.

Let me sketch how this would work:

  • basestring is a supertype of str and unicode. All operations on basestring return basestring objects, and basestring method arguments have basestring types. AnyStr would range over str, unicode and basestring. Stubs would generally use str, basestring or AnyStr but user code could also decide to use unicode.

The main new challenge compared to the first approach is that it may be difficult to make it practical to use unicode types everywhere in user code, as library functions returning basestring could screw things up. So it's going to be important that library functions in stubs usually don't return basestring but instead return AnyStr, for example.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 21, 2016

An executive summary of the basestring approach I sketched above compared to approach 1:

Good: user code can opt for strict unicode checking, less hacky string subtyping

Bad: an extra string type makes things a little more complicated, stubs are harder to write, typeshed needs a major update, code using basestring needs extra conversions to use code that uses strict unicode

@gvanrossum
Copy link
Member
gvanrossum commented Jan 21, 2016 via email

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 21, 2016

Ah okay. I'm not excited about having union type as the default way of doing anything common, as they are unintuitive to use, even if there is a type alias behind them. If we are going to have a basetring type I'd suggest making it a class (essentially an ABC). This is clearly a philosophical difference between me and pytype, as pytype tends to infer a lot of union types and I've tried to avoid them as much as possible. I believe that pytype could still replace str or unicode in their inference results with basestring even if basestring was a class.

@gnprice
Copy link
Collaborator
gnprice commented Jan 21, 2016

As a bit of data on approach 1, several of the most frequently-hit exceptions in the Dropbox server codebase are UnicodeDecodeErrors and UnicodeEncodeErrors. It's not easy to tell for all of them, but many of them seem to be because a str is used where a unicode is expected or vice versa. So making the distinction sharply would actually be helpful for catching bugs.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 21, 2016

Approaches 3 and 4 would give the most effective detection of Unicode related errors. Approaches 2 and 5 would give some detection but built-in methods would probably have to be permissive and there would still be room for runtime errors.

However, to adopt these strict approaches existing code may need a lot of work to pass type checking.

@gvanrossum
Copy link
Member

So while in theory the return type of .encode() and .decode() depend on
the encoding name, in practice the types seem to be

basestring.encode(enc: str) -> bytes
basestring.decode(enc: str) -> unicode

@matthiaskramm
8000 Copy link
Contributor

I'm with Greg on this one. We want to explicitly distinguish between str and unicode, to shake out bugs.
One of the main reasons we're working on type checking is to help us port code to Python 3. The most difficult part of porting to Python 3 is dealing with str and unicode. If a type checker can help sort those out ahead of time, that's a win.

Also, basestring as a shortcut for Union[str, unicode] sounds useful.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 22, 2016

I suggest that we take a few steps back and think about what we are trying to achieve via type checking.

Things that mypy can type check right now (using Python 3 syntax but Python 2 semantics):

a) Passing unicode when a function expects str

def f(x: str): ...
f(u'foo')   # Error

b) Adding unicode to a str collection (special case of a, really)

x = []  # type: List[str]
...
x[i] = u'foo'  # error

These would be easy to support with small stub changes:

c) Decode unicode

u'\u1234'.decode(...)

d) Encode str

'\xff'.encode('utf8')

Here are some additional things that we could catch if mypy did things differently:

  1. Concatenating str and unicode (and related)
'\xff' + u'foo'
u'%s' % '\xff'
u'{}'.format('\xff')
  1. Passing str to a user function when unicode is expected:
def f(x: unicode) -> None: ...
f('\xff')
  1. Calling functions with a mix of str and unicode that can fail
os.path.join(u'foo', '\xff')
  1. Adding strto a unicode collection
x = []  # type: List[unicode]
...
x[i] = '\xff'

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 22, 2016

[I submitted the previous comment accidentally when I was still writing it. I edited it afterwards but you may get an email with the old contents.]

As followup to the previous comment, we should decide whether we want to catch issues like 1, 2, 3 and 4. Just removing str -> unicode promotion (and using Union[str, unicode] or basestring instead of unicode in stubs) would help with 2), 3) and 4), but it wouldn't catch 1).

Also, as I mentioned earlier, we should look into how the changes affect type checking existing code, as there are some subtle interactions. The basestring change discussed earlier may affect type checking of functions that use AnyStr, in particular. Consider this function:

def f(a: AnyStr, b: AnyStr) -> AnyStr:
    return os.path.join('base', a, b)

Currently the function type checks fine, because 'base' is compatible with unicode. If we'd remove this compatibility, we'd have to modify it, such as like this:

def f(a: AnyStr, b: AnyStr) -> AnyStr:
    if isinstance(a, str):
        base = 'base'
    else:
        base = u'base'
    return os.path.join(base, a, b)

The modified version is much longer and also less efficient. I'm not sure if people would be happy to modify their code like that -- they might prefer to use Any types even if they could figure out how to change their code.

Also, consider a naive version of the annotated f:

def f(a: basestring, b: basestring) -> basestring:
    return os.path.join('base', a, b)

This looks fine on the surface, but the call to os.path.join would be rejected by a type checker because it mixes str and unicode.

However, consider a slightly different f body:

def f(a: basestring, b: basestring) -> basestring:
    return 'base' + '/' + a + '/' + b

This might actually not generate any type check errors (depending on how we deal with 1) above), even though it can fail similarly at runtime to the original f.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 22, 2016

Another random idea: We could have an internal type called ascii_str that would be the type of a string literal that only has 7-bit characters. It would be a subtype of both str and unicode, but str would not be a subtype of unicode. Now the AnyStr example from above would work:

def f(a: AnyStr, b: AnyStr) -> AnyStr:
    return os.path.join('base', a, b)   
         # no need for isinstance(...) since 'base' has type 'ascii_str'

The ascii_str type wouldn't be supported in annotations and would only work locally within functions or expressions. This would be a variant of the original approach 4 -- now str would conform to bytes in approach 4 and ascii_str would correspond to str in approach 4. I'm not sure whether this would actually work in general, though (same applies to approach 4).

@gvanrossum
Copy link
Member

I like having a variety of string types to express the various cases that occur in real PY2 code, while still catching as many bytes/unicode errors as possible.

The error cases are mostly code that mixes str and unicode instances assuming that the str instances are 7-bit while they occasionally contain 8-bit data. They will raise some UnicodeError only when given 8-bit data.

The false positives for such error checks would be code that looks the same as above but in reality only gets fed 7-bit data. My intuition is that the most common case for that is when the data is always coming from 7-bit string literals. So having a separate type inferred for 7-bit literals would be very useful.

I also like distinguishing between str and bytes even though they are the same class -- clarifying intention is the reason the bytes alias was introduced. (Note there's also a b'...' literal.)

Maybe we can draw up a class hierarchy including all the types we care about? I think the set of types would be:

  • basestring
  • bytes
  • str
  • unicode
  • ascii_str

I think it's non-controversial that basestring is the base of them all, with bytes and unicode as distinct subtypes, but beyond that it's tricky. Maybe str could be another distinct subtype of basestring, with ascii_str being a subtype of all? I.e.

class basestring: ...
class bytes(basestring): ...  # type of b'...'
class str(basestring): ...  # type of '...' with some 8-bit chars
class unicode(basestring): ...  # type of u'...'
class ascii_str(bytes, str, unicode): ...  # type of '...' with only 7-bit chars

We would strive for most code to be annotated using mostly bytes and unicode, or str when it's "old school text processing". I'm fine with not having a way for user code to spell ascii_str, as long as it's inferred appropriately for literals.

I guess AnyStr should have all four as constraints! It would be useful for many os and os.path functions, and many things that call them.

Code would be prevented to mix bytes, str and unicode -- mixing bytes or str with unicode is the source of UnicodeErrors, while mixing bytes with str point to conceptual impurity that needs to be resolved before the code can be ported to PY3.

In order to make it easier to write straddling code, we can define typing.unicode as an alias for builtins.unicode in PY2, and for builtins.str in PY3. (I prefer this over borrowing the string types from six -- not everyone uses six in their straddling code.)

I expect that this scheme will cause a certain amount of pain when starting to annotate "wild west" (I mean "real world" :-) PY2 code, but the pain should be exactly in the places that would cause problems when porting to PY3 (or straddling).

I wonder if we could experiment with a new approach using a different command line flag? E.g. --py2-strict.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Jan 26, 2016

I've thought more about ascii_str and we need to consider what to do about type inference.

def f(x: List[str]) -> None: ...
a = ['']  # infer type List[ascii_str] or List[str]? or require an explicit annotation?
f(a) # is this okay?

Also this:

a = ['']
a.append('\x80')  # ok?

Yet another:

if x:
    s = ''
else:
    s = u'foo'  # error?

We could perhaps always promote inferred ascii_str to str within generic types unless there is an explicit annotation with AnyStr. Thus r would have to be annotated here:

def f(x: AnyStr) -> List[AnyStr]:
    r = ['']  # type: List[AnyStr]
    r.append(x)
    return r

isinstance type checking likely needs to be a little different:

def f(x: AnyStr) -> AnyStr:
    if isinstance(x, str):
        return ''   # this would be taken if x is str, bytes or ascii_str?
    else:
        return u'foo'   # only taken if x is unicode

Another example:

def f(x: AnyStr) -> AnyStr:
    if isinstance(x, bytes):
        return b''   # this would be taken if x is str, bytes or ascii_str (Python 2)
    else:
        return u'foo'   # only taken if x is unicode (Python2; str in Python 3)

Note that in the above example b'' needs to be a subtype of str and ascii_str if we want the code to be Python 3 compatible -- but this is against the design :-(

We also need to decide it we'd disallow things like basestring + basestring? I.e., does one generally have to use isinstance guards before operating on basestring objects?

Finally, what about Python 2 library functions that only accept str values? I assume we'd need an AnyStr alternative that ranges over ascii_str, str and bytes, and in Python 3 mode it would be an alias for bytes.

To experiment with all this, we may have to create separate stubs (at least for builtins) that follow the new conventions. The command-line option would enable the new stubs in addition to the new type checking mode.

If/when we can pull this off, we could plausibly share many stdlib stubs between Python 2 and 3, at least once mypy also supports conditional Python version checks.

Because of the complications mentioned above, perhaps we should just type check in both Python 2 and Python 3 modes in parallel and merge the results somehow? Python 2 mode could be permissive, since Python 3 mode would potentially catch most issues that could affect Python 2 as well. Not sure about the latter, though.

@ddfisher ddfisher added bug mypy got something wrong needs discussion and removed bug mypy got something wrong labels Mar 1, 2016
@ddfisher ddfisher added this to the 0.4.0 milestone Mar 1, 2016
@vlasovskikh
Copy link
Member

This issue is becoming important in the context of newly-added Python 2.7 + Python 3 compatible comment-based syntax. See python/typing#19.

@gvanrossum
Copy link
Member

Yes, it's important, but I'm also at a loss how to solve it. :-( That's why I didn't add it to the typing 3.5.2 milestone yet. Maybe you can bring it up on python-ideas? Some very bright minds there.

@vlasovskikh
Copy link
Member

I've started a discussion at python-ideas "Type hints for text/binary data in Python 2+3 code" and posted a draft based on the ideas discussed here + some ideas based on the usage of our old type system in PyCharm.

@vlasovskikh
Copy link
Member
8000

@JukkaL I would really like to hear your feedback on this proposal based on your approach (1). I've posted it to python-ideas as a reply to "Type hints for text/binary data in Python 2+3 code".

@gvanrossum
Copy link
Member

At Dropbox we are working with a large Python 2.7 codebase that has no chance of being type-checked in Python 3 mode (the number of errors would be too overwhelming). Yet we would like to find bugs in code that relies on implicit str<->unicode conversions. So I'm still hoping we'll be able to do something better than allow all such implicit conversions. I also hope we're not unique -- surely many legacy codebases exist where these implicit conversions are a major block for moving to Python 3.

I guess we could use a non-standard copy of typeshed where str and bytes are separate types, and/or a non-standard version of mypy that always complains about implicit str<->unicode conversions? I'll have to think more about this. It would also be nice if encode()/decode() calls were only allowed in the direction that Python 3 allows them; I've seen many people utterly confused by code that was doing clever things with them.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Mar 26, 2016

If we make str and unicode compatible both ways in Python 2 mode, it would be almost the same as making them the same type. This would make Python 2 mode pretty weak. Although the way mypy doesn't treat unicode as compatible with str will cause some false positives, I haven't heard any complaints about that feature yet. (I'm not sure if our Python 2 users have tried to type check code doing tricky str/unicode manipulation, though.) I'd prefer the one-way implicit promotion as supported by mypy currently to the proposed two-way promotion unless we get some data from real programs that would indicate otherwise.

However, I agree with Guido that it would be even nicer to type check code in Python 2 mode and catch more unicode-related errors. The proposal with most promise in my mind is similar to @vlasovskikh's first proposal with Text and bytes and an internal type for ascii strings. Treating u'ascii' as a separate type seems to overcomplicate things, so I'd leave that out (but see below for more about this). It's hard to say whether this would work in practice, so doing a largish-scale experiment seems like the way to go here.

Here are my reasons why I'd rather do better type checking in Python 2 mode:

  • It's likely going to be harder to migrate to Python 2+3 compatible code than just getting Python 2 code type check cleanly with more precise str/unicode checking. Complicating factors include these:
    1. Users would have to reason in two modes (Python 2 and 3) and understand their differences.
    2. All stubs would need to be Python 2 and 3 compatible. This can be awkward if dependencies include Python 2 only libraries, which still exist.
    3. Users will have to deal with many Python 2/3 incompatibilities early on, including syntax, even if they just want better type checking, and have no immediate intent to migrate to Python 3.
  • Type checking in Python 2+3 mode would likely be about half as fast compared to just Python 2 mode, assuming that we can parallelize checking in Python 2 mode. This is a pretty significant drawback.
  • As pointed out by Andrew Barnert on python-ideas, the Python 2+3 mode would still miss runtime errors that only happen in Python 2. I'm not sure how significant this would be, but Python 3 mode would not be a perfect proxy for Python 2 correctness as there is no 1:1 mapping between types in Python 2 and 3 modes (str in Python 2 can map to bytes or str in Python 3).

Here are some of my thoughts about experimenting with the better Python 2 checking mode using mypy:

  • We'd need some separate typeshed stubs, with at least new stubs for typing and builtins, and possibly others.
  • The builtins stub would need revamped stubs for str and unicode, and a new stub for _AsciiStr.
  • Python 2 str would not support encode, and unicode would not support decode.
  • typing would need Text (though that's optional for the experiment as we can use unicode) and a new definition for AnyStr that ranges over str, unicode and _AsciiStr.
  • Mypy should recognize _AsciiStr as compatible with both str and unicode.
  • Mypy should give 7-bit string literals the type _AsciiStr.
  • Mypy should not promote str to unicode.
  • Mypy type inference should be more clever about inferring types for variables that are initialized to _AsciiStr. We can use the concept of partial types for this and this might not be too hard, at least for the most common cases. For things like x = [''] mypy might have to fall back to type List[str], even though the user might have intended List[unicode].
  • Add a command-line option that enables the new rules for Python 2 and uses the experimental stubs.
  • Once all the above are done, we can experiment with real Python 2 code, including things like the implementation of os.path, to evaluate how well things work.

The above rules will still reject things like getattr(x, u'foo'), but I think that this is not too serious. Perhaps we can handle some of the cases like getattr with local ad-hoc type checking rules such as "if an ascii unicode literal is used in an argument context that requires str, don't complain".

@bdarnell
Copy link
Contributor

Another point in favor of Union[str, bytes] instead of the TypeVar AnyStr: A union can be "narrowed" with a cast. It's common in my py2-compatible code to accept both bytes and unicode, then immediately coerce to one or the other:

def f(x: Union[str, bytes]):
    if isinstance(x, bytes):
        x = x.decode('utf8')
    x = cast(str, x)
    # From here on x is a str instead of a union.

With an AnyStr, assigning the cast fails. This is easy to work around (just use different names), but it's a more natural fit for my existing code to accept a Union and narrow it instead of making two separate variables (I can't rename the parameter because of the possibility that it is passed by keyword, so I have to rename the local variable instead, probably giving it a more awkward name even though it is referenced more frequently than the parameter)

It would also be awesome if mypy could infer this narrowing on its own so I wouldn't need the explicit cast. That's probably asking too much but it would be possible to determine that after the if, it's no longer possible for x to be bytes.

@gvanrossum
Copy link
Member
gvanrossum commented Apr 24, 2016

In that example, does f() always return a string? In that case AnyStr would
be completely out of place; its intended use is when the arg type(s) and
return type vary together (like os.listdir()).

I agree that the cast should ideally not be necessary; can you file that
separately? I'm not sure how easy that would be to implement -- it feels
like a fair amount of special-casing of unions and isinstance and branching
would be required to get this right.

If the cast feels too expensive, I think you could rewrite it using a new
variable, e.g.

if isinstance(x, bytes):
    y = x.decode('utf8')
else:
    y = x
return y

A plain assignment is way faster than a cast() call (until we teach
CPython about cast, anyway).

UPDATE: Why does GitHub not accept MarkDown from email?

@bdarnell
Copy link
Contributor

Right, I don't want the AnyStr behavior of linking two types together, but often there is only one string argument (and a return value of some fixed type), so I discovered this when I used AnyStr out of laziness. I found myself wishing that AnyStr had been defined as Union[str, bytes] and the TypeVar version had a different name that reflected its more specialized use.

I'm not too concerned about the cost of the cast() calls at this point, but I'll probably use the two-variable technique instead if I need any in performance-critical code. (Is it too ugly to suggest that x # type: T be recognized as equivalent to x = cast(T, x)?)

@gvanrossum
Copy link
Member
gvanrossum commented Apr 24, 2016 via email

@JukkaL
Copy link
Collaborator Author
JukkaL commented Apr 24, 2016

The cast should be redundant in the original example. If it isn't, it's a bug -- but I can't repro.

@gvanrossum
Copy link
Member
gvanrossum commented Aug 10, 2016

I have a new proposal. In Python 2.7, there's a bytes alias for str. I propose to distinguish between these two in such a way that bytes and str are compatible with each other, and also str and unicode are compatible with each other, but bytes and unicode are not compatible with each other.

This isn't as thorough as the idea of introducing an "ascii bytes" type (used for str literals containing no non-ASCII bytes, and treated as a subclass of str), but a quick experiment with that idea suggests that it is infeasible. The main problem is that e.g. ['a', 'b', 'c'] would have type List[AsciiBytes], which is not a subtype of List[str] even though AsciiBytes is a subtype of str. This leads to a large number of errors on existing code bases.

An even quicker experiment has suggested that the bytes-str-unicode idea is more feasible. (However, it requires a change to typed_ast.)

UPDATE: The basic idea is that unconverted Python 2 code can use str/unicode and mix them in potentially unhealthy ways -- there's just no way that we're going to get a handle on that in mypy. However, users can start using bytes and typing.Text to signal that their code is intended to be straddling (i.e. compatible with Python 2+3) and then run mypy twice, once in Python 2 mode and once in Python 3 more, to ensure that it remains compatible with both. (Note that typing.Text remains a pure alias for unicode.)

@JukkaL
Copy link
Collaborator Author
JukkaL commented Aug 10, 2016

The new proposal doesn't seem to address one of the issues we tried to solve with the alternative proposal involving special types for ASCII-only strings. Consider this code:

getattr(o, u'foo')

The precise type for the second getattr argument (and apparently for many other functions implemented in C) would be Union[str, ASCIIUnicode]. Union[bytes, unicode] would also work but it would be too general, as the function doesn't accept unicode strings with non-ascii characters. We probably don't want to make ascii-only unicode literals generally subtypes of str, since they can't be concatenated to non-ascii str objects, or generally combined with with arbitrary str (or bytes) objects in operations.

My alternative proposal would infer type List[str] for x in code like the below -- AsciiBytes (or AsciiStr) types would be promoted to str implicitly when inferring variable types:

x = ['a', 'b', 'c']

We may want to make it possible to override these with an annotation, so these would be valid as well:

y = ['a', 'b', 'c']  # type: List[AsciiStr]
z = ['a', 'b', 'c']  # type: List[unicode]

The details of all this are still a little unclear to me, and I don't really know if the ascii-only type proposal would work in practice. At least it seems to solve the getattr type checking issue. It could be a major problem for code using unicode_literals.

@gvanrossum
Copy link
Member

@JukkaL do you have a way to make the type of [''] be List[str] instead of List[AsciiBytes]? (And ditto for dicts and sets.) Then I could experiment some more with that approach.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Aug 11, 2016

My idea would be to do two things:

  1. If the inferred type of a variable is AsciiBytes, promote it to str. This handles cases like x = ''.
  2. If we infer constraints like object <: T <: AsciiBytes for a type variable when solving constraints, promote it to str instead of picking AsciiBytes since str is included in the range. If we infer constraints like AsciiBytes <: T <: AsciiBytes then we can't promote since the range is too narrow. This happens in mypy/solve.py and would likely solve x = [''].

@gvanrossum
Copy link
Member

I'm still skeptical about AsciiBytes, although it would be nice if I could do the experiment over with your proposed promotions.

My problem is that nobody writes getattr(x, u"\u1234") (or getattr(x, "τ")). People have some code that computes a string into a variable, say s, and then call getattr(x, s). This totally works when s has type str, in both Python 2 and Python 3. If s has type unicode in Python 2, it may or may not work, and this is a problem. But I'm not sure any more if this is actually the problem we're trying to solve. From a tiny bit of user research I've done, people want to write straddling code, and for that, "solving" the unicode problem in Python 2 (e.g. by using s.encode('utf8')) is actually counter-productive (since bytes are invalid in Python 3 for getattr()).

Now, I haven't completely worked out the best rules to use with my proposal, and maybe the rules should not allow silent propagation from unicode to str (only from str to unicode). But I'm still pretty skeptical that AsciiBytes is going to help much for writing more portable code, and I'm also skeptical that it would catch enough pure Python 2 issues without being annoying. (But I realize that right now it's annoying for a reason we can fix.)

@JukkaL
Copy link
Collaborator Author
JukkaL commented Aug 12, 2016

getattr(x, 'literal', default) is pretty common, and it will have the same issue when using unicode_literals. The same issue also affects other C functions, and focusing on getattr may be counterproductive. For example, this works but is rejected by mypy:

from __future__ import unicode_literals
from typing import IO

def f(f: IO[bytes]) -> None:
    f.write('foo')  # this works for file objects, but mypy doesn't accept it

However, the code is questionable, as some binary file-like objects might not accept ascii unicode objects. This may be underspecified in the Python documentation.

Let's look at how the str / unicode / bytes proposal by @gvanrossum would work in various contexts, and what other implications it would have.

I assume that in cases like x = 'foo' the inferred type for x would be str. To get type bytes, we'd use x = b'foo' or x = 'foo' # type: bytes. Similarly, x = [''] would infer type List[str] for x.

For getattr to work with ascii unicode literal arguments, we could annotate the second argument as str, and it would allow any string-like objects to be used as the attribute name, including implicit unicode literals. This wouldn't generate false positives but it would also miss some errors. In general, all library functions that accept ascii unicode objects could be annotated with str instead of bytes (as it's simpler than Union[bytes, unicode]). This would make unicode_literals more practical at the cost of missing some type errors in Python 2 mode.

Subtyping would no longer be be transitive -- bytes is a subtype of str and str is a subtype of unicode, but bytes is not a subtype of unicode. As type checking will usually use a is-compatible-with relation where things involving Any aren't transitive anyway, this probably isn't a blocker, but we need to be careful about it. When calculating joins and meets, we don't want to use a two-way compatibility relation, as otherwise join(str, unicode) would be ambiguous -- the result should be unicode. The three-way join of str, bytes and unicode should be object (or perhaps Sized).

Library functions that accept List[str] or List[unicode] (or other similar invariant collections) currently would perhaps have to be written in terms of a type variable, ranging over either (str, bytes), (str, unicode) or (str, bytes, unicode), at least if the list won't be modified by the function. This example illustrates why:

def f1(x: List[str]) -> None: ...   # library function in a stub
def f2(x: List[bytes]) -> None: ...  # ditto

x = ['']
f1(x)  # ok
f2(x)  # error
y = [b'']
f1(y)  # error
f2(y)  # ok

All the above calls should arguably be okay if the functions don't modify the arguments. The issue is less relevant for user code, since we can assume that users consistently use either List[str] or List[bytes] most of the time, depending on whether the code is written as 'unicode-clean' with respect to type checking. However, when gradually migrating from unsafely annotated code that uses str a lot to safer code that uses bytes and unicode, the same issue may arise at safe/unsafe boundaries. If there are no type variables for either (str, bytes) or (str, unicode) in typing, users would have to define them themselves. For Python 2/3 compatible code it could look like this:

from typing import TypeVar
import sys

if sys.version_info[0] > 2:
    AnyText = str
else:
    AnyText = TypeVar('AnyText', str, unicode)

def f(x: List[AnyText]) -> None: ...

Of course, we could just add the definitions (named AnyText and AnyBytes, perhaps) to typing. Maybe a type variable that ranges over (str, bytes, unicode) in Python 2 but is aliased to just str in Python 3 would also be useful.

It's a little unclear what to do with isinstance if the argument is a str object. For example:

def f(x: Union[str, int]) -> int:
    if isinstance(x, str):
        return 0  # x is str here? or bytes?
    else:
        return 1  # x is int here? likely not Union[unicode, int]

Another example:

def f(x: Union[str, int]) -> int:
    if isinstance(x, bytes):
        return 0  # x is bytes here?
    else:
        return 1  # x is int here? or Union[unicode, int]?

Yet another:

def f(x: Union[str, int]) -> int:
    if isinstance(x, unicode):
        return 0  # x is unicode here? or is this considered unreachable?
    else:
        return 1  # x is Union[str, int] here? or Union[bytes, int]?

AnyStr would range over (str, bytes and unicode). It looks like this wouldn't cause many problems.

str would support both encode and decode, and the return types would be bytes and unicode, respectively. unicode would only support encode and bytes would only support decode.

@gvanrossum
Copy link
Member

We talked about this offline for a while. We haven't decided yet, but it appears I had poorly explained my bytes/str/text proposal and Jukka's response was not so relevant. I want to move the discussion to python/typing#208 so I'll continue there.

@JukkaL
Copy link
Collaborator Author
JukkaL commented Aug 9, 2017

Closing this issue since discussion has moved to python/typing#208 and the resolution is to keep the current behavior.

Uh oh!

There was an error while loading. Please reload this page.

@JukkaL JukkaL closed this as completed Aug 9, 2017
n8henrie added a commit to n8henrie/typeshed that referenced this issue Jan 9, 2018
It seems that AnyStr has the main advantage of ensuring that multiple arguments (or argument(s) and return values) have the same type (i.e. `str` and `bytes` are both okay but shouldn't be mixed). However it makes it significantly more difficult to cast to a single type.

With a Union, something like:

```python
def foo(a: Union[bytes, Text]) -> Text:
    if isinstance(a, bytes):
        a = a.decode()
    return "Hello, " + a
```

works fine. With AnyStr, you have to define a new intermediate variable, something like:

```python
def foo(a: AnyStr) -> str:
    if isinstance(a, bytes):
        b = a.decode()
    else:
        b = a
    return "Hello, " + b
```

If there's no advantage to using AnyStr (since there's no other AnyStr argument or return value), I think the Union would be simpler, have no significant disadvantage, and there is [precedent for similar changes](python#1054).

> It's only the case that if there's exactly one parameter of type exactly AnyStr, and no other use of AnyStr in the signature, then Union[str, bytes] should be acceptable.

-  [gvanrossum](python#439 (comment))

Discussion: 

- python#1054
- python/mypy#1141
- python#439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0