-
Notifications
You must be signed in to change notification settings - Fork 166
Closed
Description
🚀 The feature
List the reference of state transition proposal in the PR:
- Initial state:
NotStarted
iter
-> Callreset
and change the state toIterating
- [Discuss] serialize/deserialize -> Keep the state as
NotStarted
(willreset
ifiter
is called afterwards)
- Initial state:
Iterating
iter
-> Callreset
and keep the state toIterating
- [Discuss] serialize/deserialize -> Change the state as
Restored
- Initial state:
Restored
iter
-> Only change the state toIterating
- serialize/deserialize -> Not allowed
There are a few things I want to discuss:
- Does serialization/deserialization using
__getstate__
and__setstate__
counts as snapshot as shown above?- I personally think so as fast forward would be the backup way to run snapshot. It would be better to use
__getstate__
to consolidate the state of DataPipe and restore it without wasting time fast forwarding. - And, if so, should we make
metaclass
to automatically change the state when__setstate__
is invoked? - And, should DataPipe converts its state from
Iterating
toNotStarted
after every iteration ends?
I know I have discussed it with Kevin the first place, but I completely forgot how we reached an agreement on keeping it asIterating
at the end of iteration. Now, I just realized there might be a corner case that this decision would screw up, especially when I wrote test cases.
- I personally think so as fast forward would be the backup way to run snapshot. It would be better to use
dp = IterableWrapper(list(range(1000))
dl = DataLoader(dp, num_workers=0)
_ = list(dl) # Running some tests
# After the above line, the DataPipe state becomes `Iterating` since single-process
# Now I create DataLoader with multiple workers
dl = DataLoader(dp, num_workers=2)
_ = list(dl)
# The DataPipe state transits from `Iterating` to `Restored` due to multiprocessing, then `reset` is not invoked
Technical speaking, this problem only happens with DataLoader
but not DataLoader2
since DLv2
should always use a copy of DataPipe
graph (even though it's not true for now). It would be great if we have a clear boundary for each state.
Motivation, pitch
We need to clarify the state transition for both users and developers.
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
No labels