[DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output #70479

NivekT · 2021-12-29T00:34:22Z

Stack from ghstack:

This PR introduces the single iterator constraint for IterDataPipe, focusing on IterDataPipe that produces one single IterDataPipe as output. This constraint is necessary because having multiple iterators referencing the same IterDataPipe can lead to incoherent internal state (e.g. shuffler may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms.

Fixes part of pytorch/data#45

The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in torch/utils/data/datapipes/_typing.py to see if the implementation is sensible. The new test TestIterDataPipeSingletonConstraint in test_datapipe.py illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear.

BC-breaking Note:

By imposing a single iterator constraint for IterDataPipe, each IterDataPipe can only have one active iterator at a time. The most recently created iterator is considered to be the active one; all previous ones are invalid. Attempting to use an invalid DataPipe will result in an error. The code example below illustrates th 8000 is behavior:

1.11.0 or before:

source_dp = IterableWrapper(range(10))
it1 = iter(source_dp)
list(it1)  # [0, 1, ..., 9]
it1 = iter(source_dp)
next(it1)  # 0
it2 = iter(source_dp)  
next(it2)  # returns 0
next(it1)  # returns 1

This PR:

source_dp = IterableWrapper(range(10))
it1 = iter(source_dp)
list(it1)  # [0, 1, ..., 9]
it1 = iter(source_dp)  # This doesn't raise any warning or error
next(it1)  # 0, works because it is a new iterator
it2 = iter(source_dp)
next(it2) # returns 0, invalidates `it1`
next(it1)  # This raises an error

A consequence of this change is that multiple downstream references to the same IterDataPipe is not allowed.

1.11.0 or before:

source_dp = IterableWrapper(range(10))
zip_dp = source_dp.zip(source_dp)
list(zip_dp)  # [(0, 0), ..., (9, 9)]

This PR:

source_dp = IterableWrapper(range(10))
zip_dp = source_dp.zip(source_dp)
list(zip_dp)  # This raises an error because there are multiple references to `source_dp`

The recommended workaround is to use .fork().

source_dp = IterableWrapper(range(10))
dp1, dp2 = source_dp.fork(2)
zip_dp = dp1.zip(dp2)
list(zip_dp)  # [(0, 0), ..., (9, 9)]

Note that: this does NOT impact MapDataPipe, Dataset, or IterableDataset.

Differential Revision: D33344609

[ghstack-poisoned]

pytorch-probot · 2021-12-29T00:34:25Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/57b5c09abf5d64222b3d9b3706720ad4d636ca30/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/ 8000 macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries/conda`	🚫 skipped
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries/libtorch`	🚫 skipped
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries/libtorch`	🚫 skipped
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries/wheel`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-bionic-py3.6-clang9	`ciflow/xla`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-12-29T00:34:27Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/70479
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit db9862f (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

NivekT · 2021-12-29T00:35:25Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…erDataPipe" Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

…erDataPipe" Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

…es with single DataPipe as output" Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT · 2022-04-28T00:11:23Z

@VitalyFedyunin @ejguan This PR and the next one in the stack are ready for you to have a look.

I expect some CI breakage since we probably re-used some DataPipes somewhere in our tests. I am in the process of going through and fixing those. I am also going through the domain use cases to see if anything breaks.

…es with single DataPipe as output" Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in that linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT · 2022-05-09T21:40:23Z

test/test_profiler.py

+        it_dp2 = iter(dp2)  # This creates a 2nd iterator, hence invalidating `it_dp1` and `dp1`
+        self.assertEqual(5, it_dp2.get_value(5))  # type: ignore[attr-defined]
+        with self.assertRaisesRegex(RuntimeError, "A separate iterator has been created"):
+            next(it_dp2)
+        with self.assertRaisesRegex(RuntimeError, "A separate iterator has been created"):
+            self.assertEqual(list(range(10)), list(it_dp2))


The necessary change to this test case here is worth having a closer look.

The basic issue was that within our constraint logic, we decided to increment the counter even when iter returns self, and this invalidates the source datapipe dp1 such that calling next will trigger an exception

NivekT · 2022-05-09T21:40:45Z

torch/utils/data/datapipes/_typing.py

+    Given an instance of a DataPipe and an iterator ID, check if the IDs match, and if not, raises an exception.
+    """
+    if hasattr(datapipe, "_is_child_datapipe") and datapipe._is_child_datapipe is True:
+        pass  # TODO: Add logic for multiple ChildDataPipe


This is added in the next PR within this stack

NivekT · 2022-05-09T21:42:32Z

torch/utils/data/datapipes/_typing.py

+            # Note that if the `__next__` and `__iter__` do something completely unrelated? It may cause issue but
+            # the user will be violating the iterator protocol


If the __next__ and __iter__ do something completely unrelated, it would violate the iterator protocol and the behavior here maybe unexpected. But as long as user sticks with iterator protocol or returns an iterator, everything should be fine.

NivekT · 2022-05-09T21:44:26Z

test/test_datapipe.py

+        # Functional Test: Check that every `__iter__` call returns the same object
+        source_dp = _CustomIterDP_Self(range(10))
+        it1 = iter(source_dp)
+        self.assertEqual(0, next(it1))
+        self.assertEqual(1, next(source_dp))
+        it2 = iter(source_dp)


As discussed offline, we are applying the single iterator constraint even when iter returns self. This means the users can at most create one iterator from the object.

NivekT · 2022-05-09T21:53:47Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ejguan

Since this PR introduces BC breaking changes, could you please add a detailed behavior changes in the PR summary? When we do release note, it will save more time.

torch/utils/data/datapipes/_typing.py

…es with single DataPipe as output" This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms. Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

test/test_datapipe.py

VitalyFedyunin · 2022-05-12T22:41:18Z

test/test_datapipe.py

+            def __iter__(self):
+                return self
+
+            def __next__(self):


Also need buggy case when it returns a new object, but also have next

I am adding a buggy case at the bottom. Let me know if it matches what you have in mind.

VitalyFedyunin · 2022-05-12T22:42:36Z

test/test_datapipe.py

+        it2 = iter(source_dp)
+        with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
+            next(it1)
+        with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):


wait, why it is invalid?

do we have at least one valid iterator after this?

In my original proposal, the users can are allowed to call __iter__ as many times as they want. But after our offline discussion, we decided to apply the constraint on __iter__ even when it may be returning self. This is because, having multiple active iterators referencing the same underlying object can lead to confusion and bugs. Plus the use case isn't really necessary when you can just reference dp. The code snippet below illustrates what the behavior would've been like if it is allowed:

dp = CustomDP(range(10)) next(dp) # returns 0 it1 = iter(dp) # returns `self` next(it1) # returns 1 it2 = iter(dp) # returns `self` next(it2) # returns 2 next(it1) # returns 3 next(dp) # returns 4

In the current implementation, calling next(dp) raises an error if there is more than 1 iterator created from that DataPipe. This is because the ID that tracks whether the iterator is valid (i.e. dp._valid_iterator_id) is tied to the DataPipe (and because it returns self, the ID is tied to the iterator as well). I think it has to be the case, otherwise when next(dp) is called. It has no idea to what ID to check against.

source_dp = _CustomIterDP_Self(range(10)) it1 = iter(source_dp) self.assertEqual(0, next(it1)) self.assertEqual(1, next(source_dp)) it2 = iter(source_dp) with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"): next(it1) with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"): next(it2) with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"): next(source_dp) # In this test case, there is no valid iterator at the end, because `iter(it1)` delegates to `iter(dp)` anyway.

The simplest workaround for users is to re-define dp via dp = CustomDP() or avoid creating more than one iterator from dp by keep using dp as the variable rather than calling iter(dp).

Here are the proposal you have

For DataPipe returning self in __iter__: We disallow users to call iter multiple times.

For DataPipe using a generator function as __iter__: We are making sure a new iterator is created and previous iterator is invalidated whenever iter is called.

As we are making iterator of DataPiep singleton (the same behavior), we are introducing a new difference on iterator. I think this is not preferable.

I think as we have reset function to help DataPipe to reset iterator. Why don't we rely on reset to clean up all buffer, etc. for DataPipe returning self in __iter__. And, the IteratorDecorator and datapipe itself should be able to track the iterator_id to invalidate previous iterator.

I can see how that can be confusing.

I think this proposal will be best, given a DataPipe dp with __next__ with iterators it1 and it2:

Allows unlimited creation of iterators, but only the latest one is valid/active

Doesn't place restriction on next(dp) (but check for invalidation when next(it1) is called

We can't really place restriction on next(dp), because any call to next(it2) gets delegated to next(dp)
within the IteratorDecorator

source_dp = _CustomIterDP_Self(range(10)) it1 = iter(source_dp) self.assertEqual(0, next(it1)) self.assertEqual(1, next(source_dp)) # Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid it2 = iter(source_dp) with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"): next(it1) self.assertEqual(0, next(it2)) self.assertEqual(1, next(source_dp))

Please examine this test for the full details:

pytorch/test/test_datapipe.py

Lines 2337 to 2390 in c77df3e

def test_iterdatapipe_singleton_self_next(self):

r"""

Testing for the case where IterDataPipe's `__iter__` returns `self` and there is a `__next__` method

Note that the following DataPipe by is singleton by default (because `__iter__` returns `self`).

"""

class _CustomIterDP_Self(IterDataPipe):

def __init__(self, iterable):

self.source = iterable

self.iterable = iter(iterable)

def __iter__(self):

self.reset()

return self

def __next__(self):

return next(self.iterable)

def reset(self):

self.iterable = iter(self.source)

# Functional Test: Check that every `__iter__` call returns the same object

source_dp = _CustomIterDP_Self(range(10))

res = list(source_dp)

it = iter(source_dp)

self.assertEqual(res, list(it))

# Functional Test: Check if invalidation logic is correct

source_dp = _CustomIterDP_Self(range(10))

it1 = iter(source_dp)

self.assertEqual(0, next(it1))

self.assertEqual(1, next(source_dp))

# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid

it2 = iter(source_dp)

with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):

next(it1)

self.assertEqual(0, next(it2))

self.assertEqual(1, next(source_dp))

# Functional Test: extend the test to a pipeline

source_dp = _CustomIterDP_Self(dp.iter.IterableWrapper(range(10)).map(_fake_fn).filter(_fake_filter_fn))

it1 = iter(source_dp)

self.assertEqual(0, next(it1))

self.assertEqual(1, next(source_dp))

# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid

it2 = iter(source_dp)

with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):

next(it1)

self.assertEqual(0, next(it2))

self.assertEqual(1, next(source_dp))

# Functional Test: multiple simultaneous references to the same DataPipe fails

with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):

for _ in zip(source_dp, source_dp):

pass

torch/utils/data/datapipes/_typing.py

ejguan · 2022-05-13T16:22:39Z

I suddenly realize there may potentially be another problem. When we invalidate the previous iterator, we don't destroy them. It means the iterator object is never released from memory except the case the DataPipe itself is an iterator.

NivekT · 2022-05-13T17:49:06Z

I suddenly realize there may potentially be another problem. When we invalidate the previous iterator, we don't destroy them. It means the iterator object is never released from memory except the case the DataPipe itself is an iterator.

The iterator object should get garbage collected if there is no longer any references to it. Correct?

dp = IterableWrapper(range(10))
it1 = iter(dp)
it2 = iter(dp)  # invalidates `it1`
`it1` can now be gc'ed because there isn't any reference to it

ejguan · 2022-05-13T18:03:56Z

The iterator object should get garbage collected if there is no longer any references to it. Correct?
dp = IterableWrapper(range(10))
it1 = iter(dp)
it2 = iter(dp)  # invalidates `it1`
`it1` can now be gc'ed because there isn't any reference to it

it1 is a reference to the iterator. Until users explicitly run del it1 or the process ends, the iterator object won't be gc'ed. It makes no sense for us to keep such reference when we invalidate it.

If possible, we could use the wrapper class. Whenever the iterator is invalidated, we should remove the internal self.iterator to release the memory.

NivekT · 2022-05-13T23:15:22Z

@VitalyFedyunin @ejguan I have addressed the comments and added more test cases.

Let's discuss these two related points if they remain to be unclear/problematic:

#70479 (comment)
#70479 (comment)

…es with single DataPipe as output" This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms. Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

test/test_datapipe.py

ejguan

The behavior based on the test cases look good to me. Could you please also update the docstring for IterDataPipe to indicate this singleton behavior?

And, could you please import this PR to Phabricator? We don't need to land from internal but it's good to validate this won't break any internal logic

test/test_datapipe.py

…es with single DataPipe as output" This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms. Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT · 2022-05-17T16:45:26Z

The behavior based on the test cases look good to me. Could you please also update the docstring for IterDataPipe to indicate this singleton behavior?

And, could you please import this PR to Phabricator? We don't need to land from internal but it's good to validate this won't break any internal logic

I can definitely update the docstring. I tried importing yesterday but it didn't work but I can try again.

…es with single DataPipe as output" This PR introduces the single iterator constraint for `IterDataPipe`, focusing on `IterDataPipe` that produces one single `IterDataPipe` as output. This constraint is necessary because having multiple iterators referencing the same `IterDataPipe` can lead to incoherent internal state (e.g. `shuffler` may not produce sensible outputs), such that we cannot save a snapshot of the DataPipe or have determinism mechanisms. Fixes part of pytorch/data#45 The expected behavior of single iterator per DataPipe is described in details within the linked issue. Please review the changes in `torch/utils/data/datapipes/_typing.py` to see if the implementation is sensible. The new test `TestIterDataPipeSingletonConstraint` in `test_datapipe.py` illustrates the behaviors that the constraint should impose. Please comment there if any behavior seems unexpected/unclear. Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT · 2022-05-17T20:10:45Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

b0noI · 2022-05-19T16:17:19Z

@pytorchbot revert this, broke internal tests, escalated to Kevin Tse [ktse@fb.com]

pytorchmergebot · 2022-05-19T16:19:26Z

Reverting PR 70479 failed due to Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 7c52f204e0311437a57706cdb1358c26226a0b02 returned non-zero exit code 1

Auto-merging test/test_datapipe.py
CONFLICT (content): Merge conflict in test/test_datapipe.py
Auto-merging torch/utils/data/datapipes/_typing.py
CONFLICT (content): Merge conflict in torch/utils/data/datapipes/_typing.py
Auto-merging torch/utils/data/datapipes/datapipe.py
error: could not revert 7c52f204e0... [DataPipe] Enforcing single valid iterator for IterDataPipes without multiple outputs
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".

Raised by https://github.com/pytorch/pytorch/actions/runs/2353257470

ejguan · 2022-05-19T16:21:25Z

@b0noI If this PR is reverted, we might need to revert the whole stack.

If the PR breaks the tests from internal TorchData tests, could we still land it?
cc: @NivekT

b0noI · 2022-05-19T16:27:08Z

landing it internally and we will do fix forward

NivekT · 2022-05-19T16:29:01Z

Thank you!

…70479) Summary: Pull Request resolved: #70479 Test Plan: Imported from OSS Reviewed By: b0noI, cpuhrsch Differential Revision: D33344609 fbshipit-source-id: c3eaee4b684890fc5dd1f0a2c6d04e718c236a7b

ejguan · 2022-06-14T16:35:24Z

torch/utils/data/datapipes/_typing.py

+    msg = ("This iterator has been invalidated because another iterator has been created"
+           f"from the same IterDataPipe: {_generate_iterdatapipe_msg(datapipe)}\n"


nit: There is a missing space between created and from in the Error message. cc: @NivekT

[DataPipe] Enforcing single iterator per IterableWrapperIterDataPipe

8bd3627

[ghstack-poisoned]

pytorch-probot bot added the ciflow/default label Dec 29, 2021

NivekT mentioned this pull request Dec 29, 2021

[DataPipe] improving DataPipe unit tests #70215

Closed

facebook-github-bot added the cla signed label Dec 29, 2021

NivekT mentioned this pull request Dec 29, 2021

[DataPipe] Basic snapshotting with IterableWrapper #70373

Closed

NivekT mentioned this pull request Dec 29, 2021

[DataPipe] Snapshotting prototype #70216

Closed

NivekT requested a review from VitalyFedyunin December 29, 2021 00:34

NivekT added the module: data torch.utils.data label Dec 29, 2021

NivekT added 2 commits December 29, 2021 18:47

Update on "[DataPipe] Enforcing single iterator per IterableWrapperIt…

4dddb35

…erDataPipe" Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

Update on "[DataPipe] Enforcing single iterator per IterableWrapperIt…

57b5c09

…erDataPipe" Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT mentioned this pull request Jan 18, 2022

[DataPipe] adding serialization test for all core IterDataPipes #71456

Closed

NivekT marked this pull request as draft January 18, 2022 23:29

Update on "[DataPipe] Enforcing single iterator per IterableWrapperIt…

289fe02

…erDataPipe" Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT mentioned this pull request Jan 19, 2022

[DataPipe] Make GroupBy serializable with lambda function #71497

Closed

suo removed the ciflow/default label Mar 22, 2022

Update on "[DataPipe] Enforcing single iterator per IterableWrapperIt…

5961fa3

…erDataPipe" Differential Revision: [D33344609](https://our.internmc.facebook.com/intern/diff/D33344609) [ghstack-poisoned]

NivekT mentioned this pull request Apr 15, 2022

[RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton) pytorch/data#45

Closed

NivekT added release notes: dataloader release notes category topic: new features topic category labels Apr 15, 2022

NivekT requested a review from ejguan April 18, 2022 14:57

NivekT mentioned this pull request Apr 18, 2022

[DataPipe] Enforcing single valid iterator for IterDataPipes multiple DataPipes as outputs #75995

Closed

NivekT changed the title ~~[DataPipe] Enforcing single iterator per IterableWrapperIterDataPipe~~ [DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output Apr 20, 2022

NivekT added 2 commits April 21, 2022 18:13

NivekT marked this pull request as ready for review April 28, 2022 00:09

NivekT commented May 9, 2022

View reviewed changes

ejguan added the module: bc-breaking Related to a BC-breaking change label May 9, 2022

ejguan reviewed May 10, 2022

View reviewed changes

NivekT mentioned this pull request May 10, 2022

Updating dataset code to avoid creating multiple iterators from a DataPipe pytorch/text#1708

Merged

ejguan reviewed May 12, 2022

View reviewed changes

test/test_datapipe.py Show resolved Hide resolved

VitalyFedyunin reviewed May 12, 2022

View reviewed changes

NivekT added 2 commits May 13, 2022 19:16

VitalyFedyunin reviewed May 17, 2022

View reviewed changes

test/test_datapipe.py Show resolved Hide resolved

ejguan approved these changes May 17, 2022

View reviewed changes

test/test_datapipe.py Show resolved Hide resolved

pytorchmergebot closed this in 7c52f20 May 18, 2022

facebook-github-bot deleted the gh/nivekt/42/head branch May 21, 2022 14:17

ejguan reviewed Jun 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output #70479

[DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output #70479

⚛️ CI Flow

		# Note that if the `__next__` and `__iter__` do something completely unrelated? It may cause issue but
		# the user will be violating the iterator protocol

	def test_iterdatapipe_singleton_self_next(self):
	r"""
	Testing for the case where IterDataPipe's `__iter__` returns `self` and there is a `__next__` method
	Note that the following DataPipe by is singleton by default (because `__iter__` returns `self`).
	"""
	class _CustomIterDP_Self(IterDataPipe):
	def __init__(self, iterable):
	self.source = iterable
	self.iterable = iter(iterable)

	def __iter__(self):
	self.reset()
	return self

	def __next__(self):
	return next(self.iterable)

	def reset(self):
	self.iterable = iter(self.source)

	# Functional Test: Check that every `__iter__` call returns the same object
	source_dp = _CustomIterDP_Self(range(10))
	res = list(source_dp)
	it = iter(source_dp)
	self.assertEqual(res, list(it))

	# Functional Test: Check if invalidation logic is correct
	source_dp = _CustomIterDP_Self(range(10))
	it1 = iter(source_dp)
	self.assertEqual(0, next(it1))
	self.assertEqual(1, next(source_dp))
	# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid
	it2 = iter(source_dp)
	with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
	next(it1)
	self.assertEqual(0, next(it2))
	self.assertEqual(1, next(source_dp))

	# Functional Test: extend the test to a pipeline
	source_dp = _CustomIterDP_Self(dp.iter.IterableWrapper(range(10)).map(_fake_fn).filter(_fake_filter_fn))
	it1 = iter(source_dp)
	self.assertEqual(0, next(it1))
	self.assertEqual(1, next(source_dp))
	# Only invalidates `it1`, and not `source_dp`. Since methods of `it2` depends on `source_dp` remaining valid
	it2 = iter(source_dp)
	with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
	next(it1)
	self.assertEqual(0, next(it2))
	self.assertEqual(1, next(source_dp))

	# Functional Test: multiple simultaneous references to the same DataPipe fails
	with self.assertRaisesRegex(RuntimeError, "This iterator has been invalidated"):
	for _ in zip(source_dp, source_dp):
	pass

		msg = ("This iterator has been invalidated because another iterator has been created"
		f"from the same IterDataPipe: {_generate_iterdatapipe_msg(datapipe)}\n"

[DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output #70479

[DataPipe] Enforcing single valid iterator for IterDataPipes with single DataPipe as output #70479

Conversation

BC-breaking Note:

⚛️ CI Flow

🔗 Helpful links

✅ No Failures (0 Pending)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment