8000 Fix an import loop by cxxxs · Pull Request #119820 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Fix an import loop #119820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Fix an import loop #119820

wants to merge 1 commit into from

Conversation

cxxxs
Copy link
Contributor
@cxxxs cxxxs commented Feb 13, 2024

Summary:
We ran into the following import loop when testing aps:

Traceback (most recent call last):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
    code = _serve_one(child_r, fds,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare
    _fixup_main_from_name(data['init_main_from_name'])
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name
    main_content = runpy.run_module(mod_name,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module>
    class ICVRConfig(AdsComboLauncherConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module>
    class AdsComboLauncherConfig(AdsConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module>
    class AdsConfig(RecTrainAppConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module>
    class EmbeddingKernelConfig:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig
    cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module>
    class ParameterSharding:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding
    sharding_spec: Optional[ShardingSpec] = None
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module>
    class ShardingSpec(ABC):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec
    tensor_properties: sharded_tensor_meta.TensorProperties,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module>
    def empty(sharding_spec: shard_spec.ShardingSpec,
ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py)

Using future annotations to mitigate.

Test Plan:

hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74
CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep

Differential Revision: D53685582

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

@pytorch-bot pytorch-bot bot added the release notes: distributed (sharded) release notes category label Feb 13, 2024
Copy link
pytorch-bot bot commented Feb 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119820

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d37a032 with merge base 78f53a3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53685582

@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 13, 2024
@Skylion007
Copy link 8000
Collaborator

Can we use a TYPE_CHECKING block instead? that specific import behavior is deprecated anyway in Python 3.13>

@fegin
Copy link
Contributor
fegin commented Feb 15, 2024

@Skylion007

It's still there according to https://docs.python.org/3.13/library/__future__.html. And the doc show the original plan is to make it mandatory in 3.11 but the decision was postponed in definitely. But no deprecation is mentioned. This is also used in other components of torch, so looks safe imo.

@Skylion007
Copy link
Collaborator
Skylion007 commented Feb 16, 2024

See: #117449 We have an issue to remove it.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53685582

@fegin
Copy link
Contributor
fegin commented Feb 16, 2024

@Skylion007 Thanks for the information.

Copy link
Contributor
@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Please ensure all CIs are green before landing the diff/PR.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 16, 2024
Summary:

We ran into the following import loop when testing aps:

```
Traceback (most recent call last):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
    code = _serve_one(child_r, fds,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare
    _fixup_main_from_name(data['init_main_from_name'])
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name
    main_content = runpy.run_module(mod_name,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module>
    class ICVRConfig(AdsComboLauncherConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module>
    class AdsComboLauncherConfig(AdsConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module>
    class AdsConfig(RecTrainAppConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module>
    class EmbeddingKernelConfig:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig
    cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module>
    class ParameterSharding:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding
    sharding_spec: Optional[ShardingSpec] = None
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module>
    class ShardingSpec(ABC):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec
    tensor_properties: sharded_tensor_meta.TensorProperties,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module>
    def empty(sharding_spec: shard_spec.ShardingSpec,
ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py)
```

Using future annotations to mitigate.

Test Plan:
```
hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74
CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep
```

Reviewed By: fegin

Differential Revision: D53685582
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@kit1980
Copy link
Contributor
kit1980 commented Feb 28, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (sharded) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0