8000 Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` by XuehaiPan · Pull Request #130341 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Fix static py::object dangling pointer with py::gil_safe_call_once_and_store #130341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

XuehaiPan
Copy link
Collaborator
@XuehaiPan XuehaiPan commented Jul 9, 2024

Fix static py::objects with py::gil_safe_call_once_and_store.

The following code will leak a py::object which will call its destructor when shutdown the program. The destructor will call Py_DECREF(obj.m_ptr) which may raise a segmentation fault.

void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}

The correct code is to use raw pointers rather than the instance.

void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}

This PR uses the py::gil_safe_call_once_and_store function from pybind11, which can run arbitrary initialization code only once under the Python GIL thread safely.

void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@XuehaiPan XuehaiPan added the better-engineering Relatively self-contained tasks for better engineering contributors label Jul 9, 2024
Copy link
pytorch-bot bot commented Jul 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130341

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9572f70 with merge base b1a00b7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: dynamo release notes: jit release notes category labels Jul 9, 2024
@XuehaiPan XuehaiPan requested review from ezyang and Skylion007 July 9, 2024 16:48
@XuehaiPan XuehaiPan changed the title Fix static py::object memory leak with py::gil_safe_call_once_and_store Fix static py::object dangling pointer with py::gil_safe_call_once_and_store Jul 9, 2024
@XuehaiPan XuehaiPan force-pushed the gil_safe_call_once_and_store branch 3 times, most recently from 199a841 to 49d2f73 Compare July 9, 2024 18:39
@ezyang ezyang requested a review from williamwen42 July 9, 2024 22:48
@williamwen42
Copy link
Member

Yep, I encountered this issue while working on dynamo 3.13 enablement. Is this behavior documented anywhere in pybind? It would be great if pybind could support code like

static py::object obj = py::module_::import("foo").attr("bar");

@ezyang ezyang added the topic: not user facing topic category label Jul 10, 2024
@ezyang
Copy link
Contributor
ezyang commented Jul 10, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 10, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug / build

Details for Dev Infra team Raised by workflow job

@XuehaiPan
Copy link
Collaborator Author

@pytorchbot merge -f "ci hanging for rocm runners"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@zou3519
Copy link
Contributor
zou3519 commented Jul 11, 2024

internal needs a third-party/pybind update (#130064 got reverted for a similar reason)

@XuehaiPan XuehaiPan force-pushed the gil_safe_call_once_and_store branch from 375caef to 24abc7c Compare July 27, 2024 16:32
@XuehaiPan XuehaiPan force-pushed the gil_safe_call_once_and_store branch from 24abc7c to 43c7a4f Compare August 1, 2024 18:35
@XuehaiPan XuehaiPan closed this Aug 3, 2024
@XuehaiPan XuehaiPan deleted the gil_safe_call_once_and_store branch August 3, 2024 05:17
@XuehaiPan XuehaiPan restored the gil_safe_call_once_and_store branch August 3, 2024 05:18
@XuehaiPan XuehaiPan reopened this Aug 3, 2024
@XuehaiPan XuehaiPan force-pushed the gil_safe_call_once_and_store branch 2 times, most recently from d176d9d to f00660c Compare August 3, 2024 05:19
albanD pushed a commit to albanD/pytorch that referenced this pull request Aug 6, 2024
…_and_store` (pytorch#130341)

Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: pytorch#130341
Approved by: https://github.com/ezyang
@albanD
Copy link
Collaborator
albanD commented Aug 6, 2024

Sorry for the delay on this one. The internal pybind upgrade necessary to merge this has been a slog :/

@XuehaiPan XuehaiPan force-pushed the gil_safe_call_once_and_store branch 3 times, most recently from 3dbb655 to 80cd4cd Compare September 1, 2024 17:29
@XuehaiPan XuehaiPan force-pushed the gil_safe_call_once_and_store branch from 80cd4cd to 9572f70 Compare September 1, 2024 17:29
pytorch-bot bot pushed a commit that referenced this pull request Sep 18, 2024
…_and_store` (#130341)

Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: #130341
Approved by: https://github.com/ezyang
@albanD
Copy link
Collaborator
albanD commented Sep 20, 2024

FYI #136370 is a rework of this PR to avoid the internal pybind issue

pytorchmergebot pushed a commit that referenced this pull request Sep 20, 2024
This is a modified version of #130341 that preserve support for older pybind version.

Pull Request resolved: #136370
Approved by: https://github.com/Skylion007, https://github.com/malfet
@XuehaiPan XuehaiPan closed this Sep 21, 2024
albanD pushed a commit to albanD/pytorch that referenced this pull request Sep 21, 2024
…_and_store` (p
8000
ytorch#130341)

Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: pytorch#130341
Approved by: https://github.com/ezyang
@XuehaiPan XuehaiPan deleted the gil_safe_call_once_and_store branch October 11, 2024 11:38
pytorchmergebot pushed a commit that referenced this pull request Dec 6, 2024
…atches (#141085)

Reland - #139560

As mentioned in #130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: #141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
pytorch-bot bot pushed a commit that referenced this pull request Dec 9, 2024
…atches (#141085)

Reland - #139560

As mentioned in #130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: #141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
pytorchmergebot pushed a commit that referenced this pull request Dec 16, 2024
…atches (#141085)

Reland - #139560

As mentioned in #130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: #141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
aditew01 pushed a commit to aditew01/pytorch that referenced this pull request Dec 18, 2024
…atches (pytorch#141085)

Reland - pytorch#139560

As mentioned in pytorch#130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: pytorch#141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
pytorchmergebot pushed a commit that referenced this pull request Dec 19, 2024
…atches (#141085)

Reland - #139560

As mentioned in #130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: #141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo module: inductor open source release notes: jit release notes category Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0