8000 [DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output by NivekT · Pull Request #70479 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output #70479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

NivekT
Copy link
Contributor
@NivekT NivekT commented Dec 29, 2021

Stack from ghstack:

This PR introduces the single iterator constraint for IterDataPipe, focusing on IterDataPipe that produces one single IterDataPipe as output. This constraint is necessary because having multiple iterators referencing the same IterDataPipe can lead to incoherent internal state (e.g. shuffler may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in torch/utils/data/datapipes/_typing.py to see if the implementation is sensible. The new test TestIterDataPipeSingletonConstraint in test_datapipe.py illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

BC-breaking Note:

  • By imposing a single iterator constraint for IterDataPipe, each IterDataPipe can only have one active iterator at a time. The most recently created iterator is considered to be the active one; all previous ones are invalid. Attempting to use an invalid DataPipe will result in an error. The code example below illustrates th 8000 is behavior:

1.11.0 or before:

source_dp = IterableWrapper(range(10))
it1 = iter(source_dp)
list(it1)  # [0, 1, ..., 9]
it1 = iter(source_dp)
next(it1)  # 0
it2 = iter(source_dp)  
next(it2)  # returns 0
next(it1)  # returns 1

This PR:

source_dp = IterableWrapper(range(10))
it1 = iter(source_dp)
list(it1)  # [0, 1, ..., 9]
it1 = iter(source_dp)  # This doesn't raise any warning or error
next(it1)  # 0, works because it is a new iterator
it2 = iter(source_dp)
next(it2) # returns 0, invalidates `it1`
next(it1)  # This raises an error
  • A consequence of this change is that multiple downstream references to the same IterDataPipe is not allowed.

1.11.0 or before:

source_dp = IterableWrapper(range(10))
zip_dp = source_dp.zip(source_dp)
list(zip_dp)  # [(0, 0), ..., (9, 9)]

This PR:

source_dp = IterableWrapper(range(10))
zip_dp = source_dp.zip(source_dp)
list(zip_dp)  # This raises an error because there are multiple references to `source_dp`
  • The recommended workaround is to use .fork().
source_dp = IterableWrapper(range(10))
dp1, dp2 = source_dp.fork(2)
zip_dp = dp1.zip(dp2)
list(zip_dp)  # [(0, 0), ..., (9, 9)]
  • Note that: this does NOT impact MapDataPipe, Dataset, or IterableDataset.

Differential Revision: D33344609

@pytorch-probot
Copy link
pytorch-probot bot commented Dec 29, 2021
CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/57b5c09abf5d64222b3d9b3706720ad4d636ca30/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/ 8000 macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-binary-conda ciflow/binaries, ciflow/binaries/conda 🚫 skipped
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries/libtorch 🚫 skipped
linux-binary-manywheel ciflow/binaries, ciflow/binaries/wheel 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-py3.6-clang9 ciflow/xla 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Dec 29, 2021

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit db9862f (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@NivekT NivekT added the module: data torch.utils.data label Dec 29, 2021
@NivekT
Copy link
Contributor Author
NivekT commented Dec 29, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…erDataPipe"


Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
@NivekT NivekT changed the title [DataPipe] Enforcing single iterator per IterableWrapperIterDataPipe [DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output Apr 20, 2022
NivekT added 2 commits April 21, 2022 18:13
…es with single DataPipe as output"


Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
…es with single DataPipe as output"


Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
@NivekT NivekT marked this pull request as ready for review April 28, 2022 00:09
@NivekT
Copy link
Contributor Author
NivekT commented Apr 28, 2022

@VitalyFedyunin @ejguan This PR and the next one in the stack are ready for you to have a look.

I expect some CI breakage since we probably re-used some DataPipes somewhere in our tests. I am in the process of going through and fixing those. I am also going through the domain use cases to see if anything breaks.

…es with single DataPipe as output"


Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
Comment on lines 170 to 175
it_dp2 = iter(dp2) # This creates a 2nd iterator, hence invalidating `it_dp1` and `dp1`
self.assertEqual(5, it_dp2.get_value(5)) # type: ignore[attr-defined]
with self.assertRaisesRegex(RuntimeError, "A separate iterator has been created"):
next(it_dp2)
with self.assertRaisesRegex(RuntimeError, "A separate iterator has been created"):
self.assertEqual(list(range(10)), list(it_dp2))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The necessary change to this test case here is worth having a closer look.

The basic issue was that within our constraint logic, we decided to increment the counter even when iter returns self, and this invalidates the source datapipe dp1 such that calling next will trigger an exception

Given an instance of a DataPipe and an iterator ID, check if the IDs match, and if not, raises an exception.
"""
if hasattr(datapipe, "_is_child_datapipe") and datapipe._is_child_datapipe is True:
pass # TODO: Add logic for multiple ChildDataPipe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added in the next PR within this stack

Comment on lines +492 to +493
# Note that if the `__next__` and `__iter__` do something completely unrelated? It may cause issue but
# the user will be violating the iterator protocol
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the __next__ and __iter__ do something completely unrelated, it would violate the iterator protocol and the behavior here maybe unexpected. But as long as user sticks with iterator protocol or returns an iterator, everything should be fine.

Comment on lines 2203 to 2208
# Functional Test: Check that every `__iter__` call returns the same object
source_dp = _CustomIterDP_Self(range(10))
it1 = iter(source_dp)
self.assertEqual(0, next(it1))
self.assertEqual(1, next(source_dp))
it2 = iter(source_dp)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we are applying the single iterator constraint even when iter returns self. This means the users can at most create one iterator from the object.

@NivekT
Copy link
Contributor Author
NivekT commented May 9, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ejguan ejguan added the module: bc-breaking Related to a BC-breaking change label May 9, 2022
Copy link
Contributor
@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR introduces BC breaking changes, could you please add a detailed behavior changes in the PR summary? When we do release note, it will save more time.

…es with single DataPipe as output"


This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
def __iter__(self):
return self

def __next__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need buggy case when it returns a new object, but also have next

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding a buggy case at the bottom. Let me know if it matches what you have in mind.

it2 = iter(source_dp)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
next(it1)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, why it is invalid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have at least one valid iterator after this?

Copy link
Contributor Author
@NivekT NivekT May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my original proposal, the users can are allowed to call __iter__ as many times as they want. But after our offline discussion, we decided to apply the constraint on __iter__ even when it may be returning self. This is because, having multiple active iterators referencing the same underlying object can lead to confusion and bugs. Plus the use case isn't really necessary when you can just reference dp. The code snippet below illustrates what the behavior would've been like if it is allowed:

dp = CustomDP(range(10))
next(dp)  # returns 0
it1 = iter(dp)  # returns `self`
next(it1)  # returns 1
it2 = iter(dp)  # returns `self`
next(it2)  # returns 2
next(it1)  # returns 3
next(dp)  # returns 4

In the current implementation, calling next(dp) raises an error if there is more than 1 iterator created from that DataPipe. This is because the ID that tracks whether the iterator is valid (i.e. dp._valid_iterator_id) is tied to the DataPipe (and because it returns self, the ID is tied to the iterator as well). I think it has to be the case, otherwise when next(dp) is called. It has no idea to what ID to check against.

source_dp = _CustomIterDP_Self(range(10))
it1 = iter(source_dp)
self.assertEqual(0, next(it1))
self.assertEqual(1, next(source_dp))
it2 = iter(source_dp)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
    next(it1)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
    next(it2)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
    next(source_dp)
# In this test case, there is no valid iterator at the end, because `iter(it1)` delegates to `iter(dp)` anyway.

The simplest workaround for users is to re-define dp via dp = CustomDP() or avoid creating more than one iterator from dp by keep using dp as the variable rather than calling iter(dp).

Copy link
Contributor
@ejguan ejguan May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the proposal you have

  • For DataPipe returning self in __iter__: We disallow users to call iter multiple times.
  • For DataPipe using a generator function as __iter__: We are making sure a new iterator is created and previous iterator is invalidated whenever iter is called.

As we are making iterator of DataPiep singleton (the same behavior), we are introducing a new difference on iterator. I think this is not preferable.

I think as we have reset function to help DataPipe to reset iterator. Why don't we rely on reset to clean up all buffer, etc. for DataPipe returning self in __iter__. And, the IteratorDecorator and datapipe itself should be able to track the iterator_id to invalidate previous iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that can be confusing.

I think this proposal will be best, given a DataPipe dp with __next__ with iterators it1 and it2:

  1. Allows unlimited creation of iterators, but only the latest one is valid/active
  2. Doesn't place restriction on next(dp) (but check for invalidation when next(it1) is called
    • We can't really place restriction on next(dp), because any call to next(it2) gets delegated to next(dp)
      within the IteratorDecorator
source_dp = _CustomIterDP_Self(range(10))
it1 = iter(source_dp)
self.assertEqual(0, next(it1))
self.assertEqual(1, next(source_dp))
# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid
it2 = iter(source_dp)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
    next(it1)
self.assertEqual(0, next(it2))
self.assertEqual(1, next(source_dp))

Please examine this test for the full details:

pytorch/test/test_datapipe.py

Lines 2337 to 2390 in c77df3e

def test_iterdatapipe_singleton_self_next(self):
r"""
Testing for the case where IterDataPipe's `__iter__` returns `self` and there is a `__next__` method
Note that the following DataPipe by is singleton by default (because `__iter__` returns `self`).
"""
class _CustomIterDP_Self(IterDataPipe):
def __init__(self, iterable):
self.source = iterable
self.iterable = iter(iterable)
def __iter__(self):
self.reset()
return self
def __next__(self):
return next(self.iterable)
def reset(self):
self.iterable = iter(self.source)
# Functional Test: Check that every `__iter__` call returns the same object
source_dp = _CustomIterDP_Self(range(10))
res = list(source_dp)
it = iter(source_dp)
self.assertEqual(res, list(it))
# Functional Test: Check if invalidation logic is correct
source_dp = _CustomIterDP_Self(range(10))
it1 = iter(source_dp)
self.assertEqual(0, next(it1))
self.assertEqual(1, next(source_dp))
# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid
it2 = iter(source_dp)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
next(it1)
self.assertEqual(0, next(it2))
self.assertEqual(1, next(source_dp))
# Functional Test: extend the test to a pipeline
source_dp = _CustomIterDP_Self(dp.iter.IterableWrapper(range(10)).map(_fake_fn).filter(_fake_filter_fn))
it1 = iter(source_dp)
self.assertEqual(0, next(it1))
self.assertEqual(1, next(source_dp))
# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid
it2 = iter(source_dp)
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
next(it1)
self.assertEqual(0, next(it2))
self.assertEqual(1, next(source_dp))
# Functional Test: multiple simultaneous references to the same DataPipe fails
with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
for _ in zip(source_dp, source_dp):
pass

@ejguan
Copy link
Contributor
ejguan commented May 13, 2022

I suddenly realize there may potentially be another problem. When we invalidate the previous iterator, we don't destroy them. It means the iterator object is never released from memory except the case the DataPipe itself is an iterator.

@NivekT
Copy link
Contributor Author
NivekT commented May 13, 2022

I suddenly realize there may potentially be another problem. When we invalidate the previous iterator, we don't destroy them. It means the iterator object is never released from memory except the case the DataPipe itself is an iterator.

The iterator object should get garbage collected if there is no longer any references to it. Correct?

dp = IterableWrapper(range(10))
it1 = iter(dp)
it2 = iter(dp)  # invalidates `it1`
`it1` can now be gc'ed because there isn't any reference to it

@ejguan
Copy link
Contributor
ejguan commented May 13, 2022

The iterator object should get garbage collected if there is no longer any references to it. Correct?

dp = IterableWrapper(range(10))
it1 = iter(dp)
it2 = iter(dp)  # invalidates `it1`
`it1` can now be gc'ed because there isn't any reference to it

it1 is a reference to the iterator. Until users explicitly run del it1 or the process ends, the iterator object won't be gc'ed. It makes no sense for us to keep such reference when we invalidate it.

If possible, we could use the wrapper class. Whenever the iterator is invalidated, we should remove the internal self.iterator to release the memory.

@NivekT
Copy link
Contributor Author
NivekT commented May 13, 2022

@VitalyFedyunin @ejguan I have addressed the comments and added more test cases.

Let's discuss these two related points if they remain to be unclear/problematic:

#70479 (comment)
#70479 (comment)

NivekT added 2 commits May 13, 2022 19:16
…es with single DataPipe as output"


This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
…es with single DataPipe as output"


This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
Copy link
Contributor
@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior based on the test cases look good to me. Could you please also update the docstring for IterDataPipe to indicate this singleton behavior?

And, could you please import this PR to Phabricator? We don't need to land from internal but it's good to validate this won't break any internal logic

…es with single DataPipe as output"


This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author
NivekT commented May 17, 2022

The behavior based on the test cases look good to me. Could you please also update the docstring for IterDataPipe to indicate this singleton behavior?

And, could you please import this PR to Phabricator? We don't need to land from internal but it's good to validate this won't break any internal logic

I can definitely update the docstring. I tried importing yesterday but it didn't work but I can try again.

…es with single DataPipe as output"


This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author
NivekT commented May 17, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@b0noI
Copy link
Contributor
b0noI commented May 19, 2022

@pytorchbot revert this, broke internal tests, escalated to Kevin Tse [ktse@fb.com]

@pytorchmergebot
Copy link
Collaborator

Reverting PR 70479 failed due to Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 7c52f204e0311437a57706cdb1358c26226a0b02 returned non-zero exit code 1

Auto-merging test/test_datapipe.py
CONFLICT (content): Merge conflict in test/test_datapipe.py
Auto-merging torch/utils/data/datapipes/_typing.py
CONFLICT (content): Merge conflict in torch/utils/data/datapipes/_typing.py
Auto-merging torch/utils/data/datapipes/datapipe.py
error: could not revert 7c52f204e0... [DataPipe] Enforcing single valid iterator for IterDataPipes without multiple outputs
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".

Raised by https://github.com/pytorch/pytorch/actions/runs/2353257470

@ejguan
Copy link
Contributor
ejguan commented May 19, 2022

@b0noI If this PR is reverted, we might need to revert the whole stack.

If the PR breaks the tests from internal TorchData tests, could we still land it?
cc: @NivekT

@b0noI
Copy link
Contributor
b0noI commented May 19, 2022

landing it internally and we will do fix forward

@NivekT
Copy link
Contributor Author
NivekT commented May 19, 2022

Thank you!

facebook-github-bot pushed a commit that referenced this pull request May 19, 2022
…70479)

Summary: Pull Request resolved: #70479

Test Plan: Imported from OSS

Reviewed By: b0noI, cpuhrsch

Differential Revision: D33344609

fbshipit-source-id: c3eaee4b684890fc5dd1f0a2c6d04e718c236a7b
@facebook-github-bot facebook-github-bot deleted the gh/nivekt/42/head branch May 21, 2022 14:17
Comment on lines +384 to +385
msg = ("This iterator has been invalidated because another iterator has been created"
f"from the same IterDataPipe: {_generate_iterdatapipe_msg(datapipe)}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is a missing space between created and from in the Error message. cc: @NivekT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: bc-breaking Related to a BC-breaking change module: data torch.utils.data release notes: dataloader release notes category topic: new features topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0