8000 Fix GPU tests for size>=2 by leofang · Pull Request #60 · mpi4py/mpi4py · GitHub
[go: up one dir, main page]

Skip to content

Fix GPU tests for size>=2 #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 19, 2021
Merged

Fix GPU tests for size>=2 #60

merged 2 commits into from
Jul 19, 2021

Conversation

leofang
Copy link
Member
@leofang leofang commented Jul 2, 2021

Work in progress.

This PR aims to fix the known test errors when there are more than 1 processes. Now multi-process tests can run fine on the same GPU (verified up tompirun -n 4). A typical symptom of such errors is random outcomes, which happen because the input arrays are not yet fully ready before MPI accesses their buffers via CUDA Array Interface.

@dalcinl
Copy link
Member
dalcinl commented Jul 3, 2021

You should make the required sync call in the as_raw() method of the backed array type.
If the method is not there, just implement it to override the one in the base class.

In short, I believe the following 8000 patch should be very close to all what you need:

diff --git a/test/arrayimpl.py b/test/arrayimpl.py
index 5df8cd9..e6b0dbe 100644
--- a/test/arrayimpl.py
+++ b/test/arrayimpl.py
@@ -327,6 +327,10 @@ if cupy is not None:
             else:
                 self.array[:] = cupy.asarray(arg, typecode)
 
+        def as_raw(self):
+            cupy.cuda.get_current_stream().synchronize()
+            return self.array
+
         @property
         def address(self):
             return self.array.__cuda_array_interface__['data'][0]

@dalcinl
Copy link
Member
dalcinl commented Jul 3, 2021

Beware of #61, you should rebase your branch and update this PR.

@dalcinl
Copy link
Member
dalcinl commented Jul 8, 2021

@leofang Did you have a chance to try my suggestion?

leofang and others added 2 commits July 19, 2021 03:41
Co-authored-by: Lisandro Dalcin <dalcinl@gmail.com>
@leofang leofang changed the title [WIP] Fix GPU tests for size>=2 Fix GPU tests for size>=2 Jul 19, 2021
@leofang leofang marked this pull request as ready for review July 19, 2021 07:42
@leofang
Copy link
Member Author
leofang commented Jul 19, 2021

You should make the required sync call in the as_raw() method of the backed array type.
If the method is not there, just implement it to override the one in the base class.

Sorry for late reply @dalcinl, I believe you're absolutely right here. I overlooked that as_raw() is always called and agree your patch is the simplest fix. I've just force-pushed and confirmed locally with --cupy and/or --numba set.

@dalcinl
Copy link
Member
dalcinl commented Jul 19, 2021

I guess these Open MPI test failures are regressions, right? I mean, they used to work before, right? What I do not like about using a plain "openmpi" without any version spec is that these tests will be disabled forever. On the other side, reporting regressions at patch releases, over and over again, is really tiring. Thoughts?

@leofang
Copy link
Member Author
leofang commented Jul 19, 2021

Frankly I don't know if they are regressions or me not doing the job right in the very beginning...😅 Look, we just added the synchronization to ensure the correctness, so perhaps previously we simply did it wrong but got lucky?

My take would be to merge as is, and revisit it later if we want to adopt cirun.io to run serious GPU tests on the CI. Currently the new skips are only in effect when we run the GPU tests locally.

@dalcinl dalcinl merged commit 533acdb into mpi4py:master Jul 19, 2021
@leofang leofang deleted the fix_gpu_test branch July 19, 2021 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0