8000 [Discussion] Snapshot state for DataPipe · Issue #741 · pytorch/data · GitHub
[go: up one dir, main page]

Skip to content
[Discussion] Snapshot state for DataPipe #741
@ejguan

Description

@ejguan

🚀 The feature

List the reference of state transition proposal in the PR:

  • Initial state: NotStarted
    • iter -> Call reset and change the state to Iterating
    • [Discuss] serialize/deserialize -> Keep the state as NotStarted (will reset if iter is called afterwards)
  • Initial state: Iterating
    • iter -> Call reset and keep the state to Iterating
    • [Discuss] serialize/deserialize -> Change the state as Restored
  • Initial state: Restored
    • iter -> Only change the state to Iterating
    • serialize/deserialize -> Not allowed

There are a few things I want to discuss:

  • Does serialization/deserialization using __getstate__ and __setstate__ counts as snapshot as shown above?
    • I personally think so as fast forward would be the backup way to run snapshot. It would be better to use __getstate__ to consolidate the state of DataPipe and restore it without wasting time fast forwarding.
    • And, if so, should we make metaclass to automatically change the state when __setstate__ is invoked?
    • And, should DataPipe converts its state from Iterating to NotStarted after every iteration ends?
      I know I have discussed it with Kevin the first place, but I completely forgot how we reached an agreement on keeping it as Iterating at the end of iteration. Now, I just realized there might be a corner case that this decision would screw up, especially when I wrote test cases.
dp = IterableWrapper(list(range(1000))
dl = DataLoader(dp, num_workers=0)
_ = list(dl)  # Running some tests
# After the above line, the DataPipe state becomes `Iterating` since single-process
# Now I create DataLoader with multiple workers
dl = DataLoader(dp, num_workers=2)
_ = list(dl)
# The DataPipe state transits from `Iterating` to `Restored` due to multiprocessing, then `reset` is not invoked

Technical speaking, this problem only happens with DataLoader but not DataLoader2 since DLv2 should always use a copy of DataPipe graph (even though it's not true for now). It would be great if we have a clear boundary for each state.

cc: @NivekT @VitalyFedyunin

Motivation, pitch

We need to clarify the state transition for both users and developers.

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0