-
Notifications
You must be signed in to change notification settings - Fork 4.4k
typehint fixes to DoOutputsTuple #10494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
R: @robertwb could you take a look at this?
|
R: @udim -- had an idea about why Optional[X] to Union[X, None] might be working. (Because Union is not imported and potentially ignored by pytype.) |
I have a WIP PR (not yet pushed) that runs pytype on Beam. I finally got to pubsub.py and it indeed says |
Reverted the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95% because I'm not 100% sure that only PCollections should be in there. |
Thank you. I will wait for them to comment. DoOutputsTuple bundles outputs of a ParDo, and I beleive that should be alway of PCollection type. |
Yeah, looks like it. (I just took another look at the ciode.)
…On Wed, Jan 8, 2020 at 2:52 PM Ahmet Altay ***@***.***> wrote:
This 95% LGTM. I also scanned the lint logs for what might be related
errors.
CC: @robertwb <https://github.com/robertwb> @chadrik
<https://github.com/chadrik> who might have more background/reasoning for
or against this change.
Thank you. I will wait for them to comment. DoOutputsTuple bundles outputs
of a ParDo, and I beleive that should be alway of PCollection type.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10494?email_source=notifications&email_token=AAA7FT54KV3NIBDNSAENOX3Q4ZKJVA5CNFSM4KCIQNU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOI24Q#issuecomment-572296562>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7FTY3RI3WP6PAA5665N3Q4ZKJVANCNFSM4KCIQNUQ>
.
|
Hey folks, I'm really glad to see the interest in type hints, but this kind of change is going to make my life a lot more difficult. I have gotten the Beam source fully passing in mypy in my other PR. This took me at least half a year to complete, and I've been patiently waiting and following up for the last 6 months (#9056 was created on July 12th) to get all of that reviewed and merged. AFAICT this is just going to break that. I would prefer that we get my other PRs completed before we merge this. I realize that my PRs are pretty big, but there's a good reason for that: typing is a balancing act and it is very difficult to judge the correctness of a few isolated type hints -- you have to reach a critical mass of reinforcing hints across the code base. So, since I did manage to achieve that critical mass in my PR, if the changes in this PR are different from my final solution (I can't recall if this code gets touched by the next typing PR) there's a good chance that this is not correct, at least not for mypy. The current PR is here: #10367 The master PR is here: #9056 |
@chadrik - I do not wish to set back your progress. However, there are downstream Beam users that are running into issues due to the typehint changes. Specific example here is that, users that have typechecks enabled in their code that accepts results of DoOutputsTuple and checking that the types are the more accurate PCollection rather than the PValue. If this PR is proposing an invalid change and outputs could actually be PValue we can drop this change and I can work with downstream consumers to fix their code. However, if this is a valid change I would rather go forward with it and support our users. IMO we need to prioritize our users. Unfortunately this might be the cost of making this change in master and available to end users in phases. I, @udim, and @robertwb will be happy to continue supporting you. In addition to reviews, if it makes we can also take some of the burden of making the actual changes. |
Are the Beam users using mypy or pytype? IIUC, Beam type hints should not get picked up by any type checker that properly supports PEP 561 because we have not added a From the PEP:
Maybe I'm misunderstanding this, but I thought that preventing this type of problem was part of the motivation behind PEP 561.
In order to know if this change is valid, I -- or someone else -- will need to look through the remaining changes in my PRs to see how I addressed this. Unfortunately, I'm pretty busy at the moment. I certainly understand the need to support users, and I accept that this change may need to go in ASAP, but I would really really love some more support getting my PRs through. |
I am aware of users using pytype. I do not know about the whole exact setup. Presumably they may have their own py.typed files and linking with Beam at source level (speculating, not sure.)
That is fair. @robertwb and @udim can review for this.
Thank you. What would be particularly helpful? Faster reviews? Splitting some work? |
sdks/python/apache_beam/pvalue.py
Outdated
@@ -308,7 +308,7 @@ def __getitem__(self, tag): | |||
self.producer.add_output(pcoll, tag) | |||
else: | |||
# Main output is output of inner ParDo. | |||
pcoll = self.producer.parts[0].outputs[None] | |||
pcoll = self.producer.parts[0].outputs[None] # type: PCollection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem safe to me. We should not be promoting the type from PValue
to a PCollection
unless we can actually guarantee that it will always be a PCollection
here. As far as I can tell, it may actually be PValue
, but correct me if I'm wrong.
Here's how I came to that conclusion.
The code where AppliedPTransform.outputs
is assigned values seems to indicate it should be a mapping of PValue
:
def add_output(self,
output, # type: Union[pvalue.DoOutputsTuple, pvalue.PValue]
tag=None # type: Union[str, int, None]
):
# type: (...) -> None
if isinstance(output, pvalue.DoOutputsTuple):
self.add_output(output[output._main_tag])
elif isinstance(output, pvalue.PValue):
# TODO(BEAM-1833): Require tags when calling this method.
if tag is None and None in self.outputs:
tag = len(self.outputs)
assert tag not in self.outputs
self.outputs[tag] = output
elif isinstance(output, dict):
for output_tag, out in output.items():
self.add_output(out, tag=output_tag)
else:
raise TypeError("Unexpected output type: %s" % output)
As a result, I have typed AppliedPTransform.outputs
as Dict[Union[str, int, None], pvalue.PValue]
. If it's the case that AppliedPTransform.outputs
only holds PCollections
then we should enforce that in the isinstance
check above (and elsewhere) and in the typing for AppliedPTransform
.
Also note that the trailing # type: PCollection
used here has no effect in mypy, and in fact it generates an error. In mypy, you can only use this the first time that a variable is defined (e.g. for pcoll
that's line 300 above this), and you can only use it to broaden a type (as I did originally, going from PCollection
to PValue
). Narrowing the type is basically disregarding/overriding the inspected type, so that requires using typing.cast()
.
This is why I'm really wary of supporting pytype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everyone is in agreement that AppliedPTransform.outputs
can only hold PCollections
then the typing for AppliedPTransform
should be fixed. Doing it this way -- simply overriding the type -- is a superficial fix that leaves the door open for this problem to happen elsewhere in the code where AppliedPTransform
is used (and as I mentioned, it doesn't actually work for mypy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, DoOutputsTuple could only follow a ParDo and its outputs should be PCollection's. However this contract is not really captured in the code. You are right that AppliedPTransform.outputs could have types of PValue, but this should not be the case for DoOutputsTuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood. If we're confident about that contract I think the proper way to solve this would be something like this:
pval = self.producer.parts[0].outputs[None]
assert isinstance(pval, PCollection), "DoOutputsTuple should follow a ParDo"
pcoll = pval
If you want to be a bit more formal you could raise a TypeError
or the like.
That will naturally narrow the type of pcoll
. The reason for the temporary variable pval
is that you cannot assign of type PValue
to a variable already defined as PCollection
(above in the if tag is not None
block).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types here are correct, self._pcolls
should be a map to PCollections. This PR looks like an improvement to me. Would you be OK with this going in (with the assertion to do the type restriction)?
(FWIW, it'd be nice to clean up the whole DoOutputsTuple thing, and how inputs and outputs are represented in general, but that's another issue.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, looks good to me.
py.typed files are created by maintainers and distributed as part of the package to indicate to type checkers whether the package should be included in the type analysis and if so, in what way. It appears that pytype does not support this yet: Can we add |
Faster reviews would be nice, but I understand that this is easier said than done, as everyone has their own tasks they're working on. I'm happy to adjust how I'm presenting the PRs for reviews, such as including fewer changes per PR. Feedback welcome! |
@boyuanzz added this comment to all py files for now. So this PR is not very urgent. We do not need to merge it immediately. |
I think making PRs smaller and spreading the load will help. @udim @robertwb @pabloem @tvalentyn myself, we could all be reviewers on PRs. Especially now we are over the initial phase and if we can get smaller PRs to review. |
@chadrik On the note of your typing changes in, yes, I'd like to work with you (and others) to finally get them in. The second pull request should be good to go, it might be worth following up on the list what the status is and what remains (as I, for one, have lost context over the holidays). |
@robertwb Thanks. Like I said, I completely understand that there's a lot on everyone's plate and that these PRs are both quite large and also non-urgent. Thanks again for your effort getting these through. Now that we've done the broad strokes I can start submitting multiple small PRs simultaneously. I'll post the links to those PRs on the list once they are ready. |
Updated this with @chadrik suggestion to assert that pval is of type PCollection |
Changed the output type for
DoOutputsTuple
from PValue to PCollection.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.