8000 typehint fixes to DoOutputsTuple by aaltay · Pull Request #10494 · apache/beam · GitHub
[go: up one dir, main page]

Skip to content

typehint fixes to DoOutputsTuple #10494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 22, 2020
Merged

typehint fixes to DoOutputsTuple #10494

merged 5 commits into from
Jan 22, 2020

Conversation

aaltay
Copy link
Member
@aaltay aaltay commented Jan 3, 2020

Changed the output type for DoOutputsTuple from PValue to PCollection.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@aaltay
Copy link
Member Author
aaltay commented Jan 3, 2020

R: @robertwb could you take a look at this?

  • Change to DoOutputsTuple make sense because the bundled outputs of a Flatmap or ParDo should be a PCollection not a PValue.
  • Other changes from Optional[X] to Uniuon[X, None] are a bit puzzling. They are supposed to be equivalents but, I noticed some downstream type checkers work with the latter but fail with the former. Still investigating this.

@aaltay
Copy link
Member Author
aaltay commented Jan 7, 2020

R: @udim -- had an idea about why Optional[X] to Union[X, None] might be working. (Because Union is not imported and potentially ignored by pytype.)

8000

@udim
Copy link
Member
udim commented Jan 8, 2020

R: @udim -- had an idea about why Optional[X] to Union[X, None] might be working. (Because Union is not imported and potentially ignored by pytype.)

I have a WIP PR (not yet pushed) that runs pytype on Beam. I finally got to pubsub.py and it indeed says Name 'Union' is not defined on the changes you made on line 146.

@aaltay
Copy link
Member Author
aaltay commented Jan 8, 2020

Reverted the Optional changes. Could you review the remaining changes?

@aaltay aaltay changed the title [WIP] typehint fixes typehint fixes to DoOutputsTuple Jan 8, 2020
@aaltay aaltay requested a review from udim January 8, 2020 19:15
Copy link
Member
@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 95% LGTM. I also scanned the lint logs for what might be related errors.

CC: @robertwb @chadrik who might have more background/reasoning for or against this change.

@udim
Copy link
Member
udim commented Jan 8, 2020

95% because I'm not 100% sure that only PCollections should be in there.

@aaltay
Copy link
Member Author
aaltay commented Jan 8, 2020

This 95% LGTM. I also scanned the lint logs for what might be related errors.

CC: @robertwb @chadrik who might have more background/reasoning for or against this change.

Thank you. I will wait for them to comment. DoOutputsTuple bundles outputs of a ParDo, and I beleive that should be alway of PCollection type.

@udim
Copy link
Member
udim commented Jan 8, 2020 via email

@chadrik
Copy link
Contributor
chadrik commented Jan 8, 2020

Hey folks, I'm really glad to see the interest in type hints, but this kind of change is going to make my life a lot more difficult. I have gotten the Beam source fully passing in mypy in my other PR. This took me at least half a year to complete, and I've been patiently waiting and following up for the last 6 months (#9056 was created on July 12th) to get all of that reviewed and merged. AFAICT this is just going to break that.

I would prefer that we get my other PRs completed before we merge this. I realize that my PRs are pretty big, but there's a good reason for that: typing is a balancing act and it is very difficult to judge the correctness of a few isolated type hints -- you have to reach a critical mass of reinforcing hints across the code base. So, since I did manage to achieve that critical mass in my PR, if the changes in this PR are different from my final solution (I can't recall if this code gets touched by the next typing PR) there's a good chance that this is not correct, at least not for mypy.

The current PR is here: #10367

The master PR is here: #9056

@aaltay
Copy link
Member Author
aaltay commented Jan 8, 2020

@chadrik - I do not wish to set back your progress. However, there are downstream Beam users that are running into issues due to the typehint changes. Specific example here is that, users that have typechecks enabled in their code that accepts results of DoOutputsTuple and checking that the types are the more accurate PCollection rather than the PValue.

If this PR is proposing an invalid change and outputs could actually be PValue we can drop this change and I can work with downstream consumers to fix their code. However, if this is a valid change I would rather go forward with it and support our users.

IMO we need to prioritize our users. Unfortunately this might be the cost of making this change in master and available to end users in phases. I, @udim, and @robertwb will be happy to continue supporting you. In addition to reviews, if it makes we can also take some of the burden of making the actual changes.

@chadrik
Copy link
Contributor
chadrik commented Jan 8, 2020

I do not wish to set back your progress. However, there are downstream Beam users that are running into issues due to the typehint changes. Specific example here is that, users that have typechecks enabled in their code that accepts results of DoOutputsTuple and checking that the types are the more accurate PCollection rather than the PValue.

Are the Beam users using mypy or pytype?

IIUC, Beam type hints should not get picked up by any type checker that properly supports PEP 561 because we have not added a py.typed to the source.

From the PEP:

Package maintainers who wish to support type checking of their code MUST add a marker file named py.typed to their package supporting typing.

Maybe I'm misunderstanding this, but I thought that preventing this type of problem was part of the motivation behind PEP 561.

If this PR is proposing an invalid change and outputs could actually be PValue we can drop this change and I can work with downstream consumers to fix their code. However, if this is a valid change I would rather go forward with it and support our users.

In order to know if this change is valid, I -- or someone else -- will need to look through the remaining changes in my PRs to see how I addressed this. Unfortunately, I'm pretty busy at the moment.

I certainly understand the need to support users, and I accept that this change may need to go in ASAP, but I would really really love some more support getting my PRs through.

@aaltay
Copy link
Member Author
aaltay commented Jan 9, 2020

I do not wish to set back your progress. However, there are downstream Beam users that are running into issues due to the typehint changes. Specific example here is that, users that have typechecks enabled in their code that accepts results of DoOutputsTuple and checking that the types are the more accurate PCollection rather than the PValue.

Are the Beam users using mypy or pytype?

I am aware of users using pytype. I do not know about the whole exact setup. Presumably they may have their own py.typed files and linking with Beam at source level (speculating, not sure.)

IIUC, Beam type hints should not get picked up by any type checker that properly supports PEP 561 because we have not added a py.typed to the source.

From the PEP:

Package maintainers who wish to support type checking of their code MUST add a marker file named py.typed to their package supporting typing.

Maybe I'm misunderstanding this, but I thought that preventing this type of problem was part of the motivation behind PEP 561.

If this PR is proposing an invalid change and outputs could actually be PValue we can drop this change and I can work with downstream consumers to fix their code. However, if this is a valid change I would rather go forward with it and support our users.

In order to know if this change is valid, I -- or someone else -- will need to look through the remaining changes in my PRs to see how I addressed this. Unfortunately, I'm pretty busy at the moment.

That is fair. @robertwb and @udim can review for this.

I certainly understand the need to support users, and I accept that this change may need to go in ASAP, but I would really really love some more support getting my PRs through.

Thank you. What would be particularly helpful? Faster reviews? Splitting some work?

@@ -308,7 +308,7 @@ def __getitem__(self, tag):
self.producer.add_output(pcoll, tag)
else:
# Main output is output of inner ParDo.
pcoll = self.producer.parts[0].outputs[None]
pcoll = self.producer.parts[0].outputs[None] # type: PCollection
Copy link
Contributor
@chadrik chadrik Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem safe to me. We should not be promoting the type from PValue to a PCollection unless we can actually guarantee that it will always be a PCollection here. As far as I can tell, it may actually be PValue, but correct me if I'm wrong.

Here's how I came to that conclusion.

The code where AppliedPTransform.outputs is assigned values seems to indicate it should be a mapping of PValue:

  def add_output(self,
                 output,  # type: Union[pvalue.DoOutputsTuple, pvalue.PValue]
                 tag=None  # type: Union[str, int, None]
                ):
    # type: (...) -> None
    if isinstance(output, pvalue.DoOutputsTuple):
      self.add_output(output[output._main_tag])
    elif isinstance(output, pvalue.PValue):
      # TODO(BEAM-1833): Require tags when calling this method.
      if tag is None and None in self.outputs:
        tag = len(self.outputs)
      assert tag not in self.outputs
      self.outputs[tag] = output
    elif isinstance(output, dict):
      for output_tag, out in output.items():
        self.add_output(out, tag=output_tag)
    else:
      raise TypeError("Unexpected output type: %s" % output)

As a result, I have typed AppliedPTransform.outputs as Dict[Union[str, int, None], pvalue.PValue]. If it's the case that AppliedPTransform.outputs only holds PCollections then we should enforce that in the isinstance check above (and elsewhere) and in the typing for AppliedPTransform.


Also note that the trailing # type: PCollection used here has no effect in mypy, and in fact it generates an error. In mypy, you can only use this the first time that a variable is defined (e.g. for pcoll that's line 300 above this), and you can only use it to broaden a type (as I did originally, going from PCollection to PValue). Narrowing the type is basically disregarding/overriding the inspected type, so that requires using typing.cast().

This is why I'm really wary of supporting pytype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If everyone is in agreement that AppliedPTransform.outputs can only hold PCollections then the typing for AppliedPTransform should be fixed. Doing it this way -- simply overriding the type -- is a superficial fix that leaves the door open for this problem to happen elsewhere in the code where AppliedPTransform is used (and as I mentioned, it doesn't actually work for mypy).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, DoOutputsTuple could only follow a ParDo and its outputs should be PCollection's. However this contract is not really captured in the code. You are right that AppliedPTransform.outputs could have types of PValue, but this should not be the case for DoOutputsTuple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood. If we're confident about that contract I think the proper way to solve this would be something like this:

      pval = self.producer.parts[0].outputs[None]
      assert isinstance(pval, PCollection), "DoOutputsTuple should follow a ParDo"
      pcoll = pval

If you want to be a bit more formal you could raise a TypeError or the like.

That will naturally narrow the type of pcoll. The reason for the temporary variable pval is that you cannot assign of type PValue to a variable already defined as PCollection (above in the if tag is not None block).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types here are correct, self._pcolls should be a map to PCollections. This PR looks like an improvement to me. Would you be OK with this going in (with the assertion to do the type restriction)?

(FWIW, it'd be nice to clean up the whole DoOutputsTuple thing, and how inputs and outputs are represented in general, but that's another issue.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, looks good to me.

@chadrik
Copy link
Contributor
chadrik commented Jan 9, 2020

I am aware of users using pytype. I do not know about the whole exact setup. Presumably they may have their own py.typed files and linking with Beam at source level (speculating, not sure.)

py.typed files are created by maintainers and distributed as part of the package to indicate to type checkers whether the package should be included in the type analysis and if so, in what way.

It appears that pytype does not support this yet:

google/pytype#151

Can we add # pytype: skip-file to this file?


67E6
@chadrik
Copy link
Contributor
chadrik commented Jan 9, 2020

Thank you. What would be particularly helpful? Faster reviews? Splitting some work?

Faster reviews would be nice, but I understand that this is easier said than done, as everyone has their own tasks they're working on. I'm happy to adjust how I'm presenting the PRs for reviews, such as including fewer changes per PR. Feedback welcome!

@aaltay
Copy link
Member Author
aaltay commented Jan 10, 2020

I am aware of users using pytype. I do not know about the whole exact setup. Presumably they may have their own py.typed files and linking with Beam at source level (speculating, not sure.)

py.typed files are created by maintainers and distributed as part of the package to indicate to type checkers whether the package should be included in the type analysis and if so, in what way.

It appears that pytype does not support this yet:

google/pytype#151

Can we add # pytype: skip-file to this file?

@boyuanzz added this comment to all py files for now. So this PR is not very urgent. We do not need to merge it immediately.

@aaltay
Copy link
Member Author
aaltay commented Jan 10, 2020

Thank you. What would be particularly helpful? Faster reviews? Splitting some work?

Faster reviews would be nice, but I understand that this is easier said than done, as everyone has their own tasks they're working on. I'm happy to adjust how I'm presenting the PRs for reviews, such as including fewer changes per PR. Feedback welcome!

I think making PRs smaller and spreading the load will help. @udim @robertwb @pabloem @tvalentyn myself, we could all be reviewers on PRs. Especially now we are over the initial phase and if we can get smaller PRs to review.

@robertwb
Copy link
Contributor

@chadrik On the note of your typing changes in, yes, I'd like to work with you (and others) to finally get them in. The second pull request should be good to go, it might be worth following up on the list what the status is and what remains (as I, for one, have lost context over the holidays).

@chadrik
Copy link
Contributor
chadrik commented Jan 11, 2020

@robertwb Thanks. Like I said, I completely understand that there's a lot on everyone's plate and that these PRs are both quite large and also non-urgent. Thanks again for your effort getting these through.

Now that we've done the broad strokes I can start submitting multiple small PRs simultaneously. I'll post the links to those PRs on the list once they are ready.

@aaltay
Copy link
Member Author
aaltay commented Jan 18, 2020

Updated this with @chadrik suggestion to assert that pval is of type PCollection

@aaltay aaltay merged commit 5656c0d into apache:master Jan 22, 2020
@aaltay aaltay deleted the typechanges branch December 20, 2022 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0