8000 ☂️ Many ecosystem libraries started to fail with `std::bad_alloc` after Nov 1st · Issue #140590 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

☂️ Many ecosystem libraries started to fail with std::bad_alloc after Nov 1st #140590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
malfet opened this issue Nov 13, 2024 · 10 comments
Closed
Labels
high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@malfet malfet added high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't labels Nov 13, 2024
@atalman
Copy link
Contributor
atalman commented Nov 18, 2024

@malfet
Copy link
Contributor Author
malfet commented Nov 18, 2024

If it's limited to the JIT script, one needs to check that parts of libstdc++ runtime are not linked separately for libtorch_cpu and libtorch_python

@malfet malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Nov 18, 2024
@atalman
Copy link
Contributor
atalman commented Nov 18, 2024

This is the error:

=================================== FAILURES ===================================
___________________________ DenseArchTest.test_basic ___________________________

self = <torchrec.models.tests.test_deepfm.DenseArchTest testMethod=test_basic>

    def test_basic(self) -> None:
        torch.manual_seed(0)
    
        B = 20
        D = 3
        in_features = 10
        dense_arch = DenseArch(
            in_features=in_features, hidden_layer_size=10, embedding_dim=D
        )
    
        dense_arch_input = torch.rand((B, in_features))
        dense_embedded = dense_arch(dense_arch_input)
        self.assertEqual(dense_embedded.size(), (B, D))
    
        # check tracer compatibility
        gm = torch.fx.GraphModule(dense_arch, Tracer().trace(dense_arch))
>       script = torch.jit.script(gm)

torchrec/models/tests/test_deepfm.py:38: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:1429: in script
    ret = _script_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:1147: in _script_impl
    return torch.jit._recursive.create_script_module(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:557: in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:626: in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:650: in _construct
    init_fn(script_module)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:602: in init_fn
    scripted = create_script_module_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:626: in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:650: in _construct
    init_fn(script_module)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:602: in init_fn
    scripted = create_script_module_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:630: in create_script_module_impl
    create_methods_and_properties_from_stubs(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

concrete_type = <torch.ConcreteModuleType object at 0x7fd799d0e070>
method_stubs = [ScriptMethodStub(resolution_callback=<function createResolutionCallbackFromEnv.<locals>.<lambda> at 0x7fd799c005e0>, ... 0x7fd799b943f0>, original_method=<bound method Linear.forward of Linear(in_features=10, out_features=10, bias=True)>)]
property_stubs = []

    def create_methods_and_properties_from_stubs(
        concrete_type, method_stubs, property_stubs
    ):
        method_defs = [m.def_ for m in method_stubs]
        method_rcbs = [m.resolution_callback for m in method_stubs]
        method_defaults = [get_default_args(m.original_method) for m in method_stubs]
    
        property_defs = [p.def_ for p in property_stubs]
        property_rcbs = [p.resolution_callback for p in property_stubs]
    
>       concrete_type._create_methods_and_properties(
            property_defs, property_rcbs, method_defs, method_rcbs, method_defaults
        )
E       MemoryError: std::bad_alloc

/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:466: MemoryError
______________ TestEvictionPolicy.test_fx_jit_script_not_training ______________

self = <torchrec.modules.tests.test_mc_modules.TestEvictionPolicy testMethod=test_fx_jit_script_not_training>

    def test_fx_jit_script_not_training(self) -> None:
        model = MCHManagedCollisionModule(
            zch_size=5,
            device=torch.device("cpu"),
            eviction_policy=LFU_EvictionPolicy(),
            eviction_interval=1,
            input_hash_size=100,
        )
    
        model.train(False)
        gm = torch.fx.symbolic_trace(model)
>       torch.jit.script(gm)

torchrec/modules/tests/test_mc_modules.py:361: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:1429: in script
    ret = _script_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:1147: in _script_impl
    return torch.jit._recursive.create_script_module(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:557: in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:630: in create_script_module_impl
    create_methods_and_properties_from_stubs(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

concrete_type = <torch.ConcreteModuleType object at 0x7fd7982f29f0>
method_stubs = [ScriptMethodStub(resolution_callback=<function createResolutionCallbackFromEnv.<locals>.<lambda> at 0x7fd79823ec00>, ...._jit_tree_views.Def object at 0x7fd7982889f0>, original_method=<bound method forward of MCHManagedCollisionModule()>)]
property_stubs = [PropertyStub(resolution_callback=<function createResolutionCallbackFromEnv.<locals>.<lambda> at 0x7fd79823e520>, def_=<torch._C._jit_tree_views.Property object at 0x7fd798231230>)]

    def create_methods_and_properties_from_stubs(
        concrete_type, method_stubs, property_stubs
    ):
        method_defs = [m.def_ for m in method_stubs]
        method_rcbs = [m.resolution_callback for m in method_stubs]
        method_defaults = [get_default_args(m.original_method) for m in method_stubs]
    
        property_defs = [p.def_ for p in property_stubs]
        property_rcbs = [p.resolution_callback for p in property_stubs]
    
&g
10000
t;       concrete_type._create_methods_and_properties(
            property_defs, property_rcbs, method_defs, method_rcbs, method_defaults
        )
E       MemoryError: std::bad_alloc

/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:466: MemoryError
__________________________ TestMLP.test_fx_script_MLP __________________________

self = <torchrec.modules.tests.test_mlp.TestMLP testMethod=test_fx_script_MLP>

    def test_fx_script_MLP(self) -> None:
        in_features = 3
        layer_sizes = [16, 8, 4]
        m = MLP(in_features, layer_sizes)
    
        gm = symbolic_trace(m)
>       torch.jit.script(gm)

torchrec/modules/tests/test_mlp.py:111: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:1429: in script
    ret = _script_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:1147: in _script_impl
    return torch.jit._recursive.create_script_module(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:557: in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:626: in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:650: in _construct
    init_fn(script_module)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:602: in init_fn
    scripted = create_script_module_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:626: in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:650: in _construct
    init_fn(script_module)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:602: in init_fn
    scripted = create_script_module_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:626: in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_script.py:650: in _construct
    init_fn(script_module)
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:602: in init_fn
    scripted = create_script_module_impl(
/opt/conda/envs/build_binary/lib/python3.12/site-packages/torch/jit/_recursive.py:630: in create_script_module_impl
    create_methods_and_properties_from_stubs(

@malfet
Copy link
Contributor Author
malfet commented Nov 20, 2024

Per @davidberard98 this issues is probably related: #140423

@FindHao
Copy link
Member
FindHao commented Nov 27, 2024

https://github.com/pytorch-labs/tritonbench/actions/runs/12055925374/job/33617344431
tritonbench ci with triton stable version 3.1.0 also has such issue.

@atalman
Copy link
Contributor
atalman commented Nov 29, 2024

If it's limited to the JIT script, one needs to check that parts of libstdc++ runtime are not linked separately for libtorch_cpu and libtorch_python

@malfet could you please provide more details, what exactly we should check ? ldd command will probably help here.

@atalman
Copy link
Contributor
atalman commented Nov 29, 2024

Could this PR caused the failure: #127936 ? Here is the PR where I reverted some of the changes landed on Nov 2 nightly: #141782 . Looks like after revert of #127936 I can't repro the std:bad_alloc anymore with repro I use executing this workfow: https://github.com/pytorch/vision/blob/main/.github/workflows/docs.yml

@jerryzh168
Copy link
Contributor

looks like the issue comes back again: https://github.com/pytorch/ao/actions/runs/12123456248/job/33798989902 after we migrate to linux_job_v2: pytorch/ao#1302

@atalman
Copy link
Contributor
atalman commented Dec 3, 2024

Revert landed in Dec 3 nightly, closing this issue.

@leslie-fang-intel
Copy link
Collaborator

Hi @atalman, since this failure only caught by CI in Torch Library, do you have any suggestion to add some accept testing to PyTorch Core CI to gate this kind of issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: regression It used to work, and now it doesn't < 4105 span class="css-truncate css-truncate-target width-fit">triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants
0