-
Notifications
You must be signed in to change notification settings - Fork 674
ENH: Add nvidia warp GPU grow cut #8767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds a Warp GPU implementation of the underlyging GrowCut algorithm as an option if support is available (Warp with CUDA). If also offers a CPU version based on Warp, but this may be slower than the vtkITK version. On informaal testing the Warp/CUDA version appears to be much faster than the vtkITK version. Results appear to be identical. Performance will vary based on GPU used. This code was ported from an earlier implementation based on OpenCL: https://github.com/pieper/SlicerCL/blob/3d04661ef7f225edf00292a92088db902787a016/GrowCutCL/GrowCutCL.cl.in The Warp version and changes to the GrowFromSeeds effect were implemented with the help of Google Gemini.
|
@lassoan can you see if this works for you on windows? I tested on mac (cpu only) and linux (gpu) and it seems to work well. The editor integration in the GUI makes sense to me, but let me know if you have suggestions. This could wait until after the 5.10 release for more testing. |
|
It worked well on my computer on Windows, too. Thank you @pieper for working on this. It is really interesting and impactful for two reasons:
Due to the CUDA dependency this cannot be the ultimate solution for parallelized algorithm development on GPU in Python, but should be usable in many cases. A few notes:
|
|
Unfortunately, the warp CPU fallback is just a theoretical possibility (at least in the case of this algorithm). On my computer, for the same segmentation that took less than 1 minute using warp GPU, and 2 minutes using classic CPU, took more than 1 hour (potentially more than that, I just stopped at 1 hour) using the warp CPU implementation. |
|
Thanks for the comments @lassoan. I would not have started the PR, but I'm planning to do some other experiments with Warp, ao I tried this as a test case. Since this pretty much worked after honestly just a couple hours I figured we should consider how close it is to working and how much benefit it would be. The suggestions you made sound pretty straightforward so I can see if I can find a little more time for this. Or maybe someone else, and someone else's AI, wants to have a go. One thing I'd like you to retest is using the cuda version a second time since the code is compiled and cached so it should be much faster the second time. I didn't run exact timings, but for me on a large volume the difference bwtween vtkITK and Warp was substantial, and is probably nonlinear with respect to the number of voxels and some features of the segmentation task. I was running on a 5090 GPU, so that would make a difference too. It should be possible to rig up some performance tests that could be conditionally shown only in developer mode so we could try on different hardware easily. Regarding the cuda issue, yes, it is a requirement right now for speed and that the CPU fallback is slow. The main reason is that the GPU implementation is brute force, compared to the "ShortCut" used in the vtkITK implementation. So I wouldn't expect this to be a good benchmark for other algorithms. Developing and implementing the ShortCut optimizations was a huge effort compared to implementing this GPU version. But of course if the ShortCut method would be implemented in Warp it could be even faster. Also, for whatever reason the warp C++ code ended up running on a single core, which may be something we can fix. I really left this in only for comparison but we should probably hide it in non-developer mode. Beyond that, until around March of this year Warp was under the cuda-style license restricting use on non-nvidia hardware. Now that it's apache 2.0 there's more incentive for other people to contribute non-Cuda back ends (perhaps via hippify. Of course, with nnInteractive we probably have less motivation to speed up GrowCut. |
|
Thanks for the PR! I was wondering if you had looked into alternatives. Namely dpctl combined with numba-dpex seem to be providing hardware abstraction for this type or processing. |
I've only tested Warp so far, but these alternatives sound nice. It's great to see these all mature and be really open source. GrowCut is simple enough that it should be easy to add different back ends and add performance tests to see how they behave on real data. |
I've tested again. The speed is about the same for the first run and subsequent runs. There is a difference in how much the cropped image is expanded beyond the seed regions between classic and warp, which makes a big difference. The warp version does not have the code that adds some extra margin to the cropped image. Therefore, if using warp after starting Slicer it runs in 16s. However, if you run the classic version first, which sets the expansion to a higher value, then warp runs in about 47s. The performance could be potentially improved by running expansion not from every voxel but from a set of active voxels. The active voxels are probably well below 1% of all the voxels. If the algorithm runs faster we could also add some kind of smoothness constraints, to reduce noise on the boundary. A multi-resolution scheme could be useful, too (simiarly to nnInteractive, first segment at lower resolutions and then refine the boundary at higher resolution). An unmet need is also segmentation of large (4-12GB) images, which takes forever to process by classic growcut. However, as you noted, this algorithm may not that relevant that much due to nnInteractive. |
This adds a Warp GPU implementation of the underlyging GrowCut algorithm as an option if support is available (Warp with CUDA). If also offers a CPU version based on Warp, but this may be slower than the vtkITK version.
On informaal testing the Warp/CUDA version appears to be much faster than the vtkITK version. Results appear to be identical. Performance will vary based on GPU used.
This code was ported from an earlier implementation based on OpenCL:
https://github.com/pieper/SlicerCL/blob/3d04661ef7f225edf00292a92088db902787a016/GrowCutCL/GrowCutCL.cl.in
The Warp version and changes to the GrowFromSeeds effect were implemented with the help of Google Gemini.