ENH: Add nvidia warp GPU grow cut #8767

pieper · 2025-10-05T16:14:00Z

This adds a Warp GPU implementation of the underlyging GrowCut algorithm as an option if support is available (Warp with CUDA). If also offers a CPU version based on Warp, but this may be slower than the vtkITK version.

On informaal testing the Warp/CUDA version appears to be much faster than the vtkITK version. Results appear to be identical. Performance will vary based on GPU used.

This code was ported from an earlier implementation based on OpenCL:

https://github.com/pieper/SlicerCL/blob/3d04661ef7f225edf00292a92088db902787a016/GrowCutCL/GrowCutCL.cl.in

The Warp version and changes to the GrowFromSeeds effect were implemented with the help of Google Gemini.

This adds a Warp GPU implementation of the underlyging GrowCut algorithm as an option if support is available (Warp with CUDA). If also offers a CPU version based on Warp, but this may be slower than the vtkITK version. On informaal testing the Warp/CUDA version appears to be much faster than the vtkITK version. Results appear to be identical. Performance will vary based on GPU used. This code was ported from an earlier implementation based on OpenCL: https://github.com/pieper/SlicerCL/blob/3d04661ef7f225edf00292a92088db902787a016/GrowCutCL/GrowCutCL.cl.in The Warp version and changes to the GrowFromSeeds effect were implemented with the help of Google Gemini.

pieper · 2025-10-05T16:15:32Z

@lassoan can you see if this works for you on windows? I tested on mac (cpu only) and linux (gpu) and it seems to work well. The editor integration in the GUI makes sense to me, but let me know if you have suggestions. This could wait until after the 5.10 release for more testing.

lassoan · 2025-10-05T20:56:30Z

It worked well on my computer on Windows, too.

Thank you @pieper for working on this. It is really interesting and impactful for two reasons:

Python developers could implement now per-pixel processing fast (without need for a C++ compiler or relying on a niche one-man project cppyy)
Algorithms can be run on the GPU (while still work on just CPU) without learning unfamiliar syntax or strange language constructs.

Due to the CUDA dependency this cannot be the ultimate solution for parallelized algorithm development on GPU in Python, but should be usable in many cases.

A few notes:

The speed improvement was significant, but also kind of disappointing. Segmenting the liver on CTLiver data set using a single CPU thread using the classic method took 2 minutes to initialize, while using warp using "NVIDIA GeForce RTX 3090" (24 GiB, sm_86, mempool enabled) it took 46 seconds. I would have expected 10-100x speed improvement on a powerful GPU compared to a single CPU thread. Probably the disappointing speed-up is due to that it is very hard to parallelize this particular algorithm and the difference could be massive for other algorithms.
I could not test the full functionality (and speed-up of updates) because updates did not work with warp. After I painted more seeds, the displayed preview image was not changed, while some processing was done, as these messages were logged:

Warp Grow-cut converged after 1 iterations.
Warp Grow-cut on volume of 512x512x481 voxels was completed in 0.0 seconds.

The warp implementation does not support masking, which is an essential feature for many people. It would be quite confusing for users if warp backend ignored masking settings. So, it would be nice to add this in this PR.
A few small issues

Backend name should not be VTK, as it is implemented in ITK and it is anyway not very specific. We could use vtkITKGrowCut instead.
It would be nice to avoid installing all those unnecessary packages (usd-core, matplotlib, pyglet) and their dependencies that warp-lang[extras] brings in. Do we really need the [extras] couldn't we just install warp-lang?
Probably we should disable backend change after initialize is completed (to avoid the need to implement logic to handling backend changes after initialization).
It would be nice to add support for distance penalty. Without that it is not possible to segment near-homogeneous regions, because homogeneous regions get randomly occupied by whatever seed gets there first. This does not have to be implemented in this PR, but it would be very simple to do it (you just add a constant DistancePenalty*StepDistance to sampleDiff).

lassoan · 2025-10-05T22:24:59Z

Unfortunately, the warp CPU fallback is just a theoretical possibility (at least in the case of this algorithm). On my computer, for the same segmentation that took less than 1 minute using warp GPU, and 2 minutes using classic CPU, took more than 1 hour (potentially more than that, I just stopped at 1 hour) using the warp CPU implementation.

pieper · 2025-10-05T23:19:34Z

Thanks for the comments @lassoan. I would not have started the PR, but I'm planning to do some other experiments with Warp, ao I tried this as a test case. Since this pretty much worked after honestly just a couple hours I figured we should consider how close it is to working and how much benefit it would be. The suggestions you made sound pretty straightforward so I can see if I can find a little more time for this. Or maybe someone else, and someone else's AI, wants to have a go.

One thing I'd like you to retest is using the cuda version a second time since the code is compiled and cached so it should be much faster the second time.

I didn't run exact timings, but for me on a large volume the difference bwtween vtkITK and Warp was substantial, and is probably nonlinear with respect to the number of voxels and some features of the segmentation task. I was running on a 5090 GPU, so that would make a difference too. It should be possible to rig up some performance tests that could be conditionally shown only in developer mode so we could try on different hardware easily.

Regarding the cuda issue, yes, it is a requirement right now for speed and that the CPU fallback is slow. The main reason is that the GPU implementation is brute force, compared to the "ShortCut" used in the vtkITK implementation. So I wouldn't expect this to be a good benchmark for other algorithms. Developing and implementing the ShortCut optimizations was a huge effort compared to implementing this GPU version. But of course if the ShortCut method would be implemented in Warp it could be even faster.

Also, for whatever reason the warp C++ code ended up running on a single core, which may be something we can fix. I really left this in only for comparison but we should probably hide it in non-developer mode.
Even though I generally prefer non-cuda implementations, when there's a solid practical benefit I wouldn't hold back.

Beyond that, until around March of this year Warp was under the cuda-style license restricting use on non-nvidia hardware. Now that it's apache 2.0 there's more incentive for other people to contribute non-Cuda back ends (perhaps via hippify.

Of course, with nnInteractive we probably have less motivation to speed up GrowCut.

Thibault-Pelletier · 2025-10-06T06:48:58Z

Thanks for the PR!
The implementation looks very nice, too bad its NVIDIA bound...

I was wondering if you had looked into alternatives. Namely dpctl combined with numba-dpex seem to be providing hardware abstraction for this type or processing.

pieper · 2025-10-06T13:00:32Z

I was wondering if you had looked into alternatives. Namely dpctl combined with numba-dpex seem to be providing hardware abstraction for this type or processing.

I've only tested Warp so far, but these alternatives sound nice. It's great to see these all mature and be really open source. GrowCut is simple enough that it should be easy to add different back ends and add performance tests to see how they behave on real data.

lassoan · 2025-10-06T13:00:55Z

One thing I'd like you to retest is using the cuda version a second time since the code is compiled and cached so it should be much faster the second time

I've tested again. The speed is about the same for the first run and subsequent runs. There is a difference in how much the cropped image is expanded beyond the seed regions between classic and warp, which makes a big difference. The warp version does not have the code that adds some extra margin to the cropped image. Therefore, if using warp after starting Slicer it runs in 16s. However, if you run the classic version first, which sets the expansion to a higher value, then warp runs in about 47s.

The performance could be potentially improved by running expansion not from every voxel but from a set of active voxels. The active voxels are probably well below 1% of all the voxels. If the algorithm runs faster we could also add some kind of smoothness constraints, to reduce noise on the boundary. A multi-resolution scheme could be useful, too (simiarly to nnInteractive, first segment at lower resolutions and then refine the boundary at higher resolution). An unmet need is also segmentation of large (4-12GB) images, which takes forever to process by classic growcut. However, as you noted, this algorithm may not that relevant that much due to nnInteractive.

pieper requested a review from lassoan October 5, 2025 16:14

STYLE: Apply lint suggestions

c61221f

pieper marked this pull request as draft October 6, 2025 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Add nvidia warp GPU grow cut #8767

ENH: Add nvidia warp GPU grow cut #8767

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

ENH: Add nvidia warp GPU grow cut #8767

Are you sure you want to change the base?

ENH: Add nvidia warp GPU grow cut #8767

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants