Gaussian splatting viewer written in Vulkan.
Main goal of this project is maximizing rendering speed.
Now that I achieved satisfactory performance with Vulkan-based viewer, I would like to catch my breath for the next steps, or stop further developments and start a new side project - compression, large scale, train, etc.
Viewer works with pre-trained vanilla 3DGS models as input.
- Fast rendering speed
- 350+ FPS on 1600x900, high-end GPU (NVidia GeForce RTX 4090)
- 50+ FPS on 1600x900, high-end MacOS laptop (Apple M2 Pro)
- 1-1.5x speed compared to SIBR viewer, but difference becomes bigger when scene is zoomed out,
- because the number of tiles increases, and
- more splats overlap in a single tile, so sequential blending operation takes more time
- Using graphics pipeline
- Draw gaussian splats over other opaque objects, interacting with depth buffer
- 100% GPU tasks
- No CPU-GPU synchronization for single frame: while GPU is working on frame i, CPU prepares a commands buffer and submits for frame i+1. No synchronization for frame i to get number of visible splats.
- Indirect sort & draw: sorting and rendering only visible points
- My vulkan radix sort implementation
VulkanSDK>=1.2
- Download the latest version from https://vulkan.lunarg.com/ and follow install instruction.
1.3
is recommended, but1.2
should also work.
cmake>=3.15
- submodules
$ git submodule update --init --recursive
- VulkanMemoryAllocator
- glm
- glfw
- imgui
- argparse
- vulkan_radix_sort: my Vulkan/GLSL implementation of Onesweep and Reduce-then-scan.
$ cmake . -B build
$ cmake --build build --config Release -j
$ ./build/vkgs_viewer # or ./build/Release/vkgs_viewer
$ ./build/vkgs_viewer -i <ply_filepath>
Drag and drop pretrained .ply file from official gaussian splatting, Pre-trained Models (14 GB).
Left drag to rotate.
Right drag to translate.
Left+Right drag to zoom in/out.
WASD, Space to move.
Wheel to zoom in/out.
Ctrl+wheel to change FOV.
-
Added SH F16 storage feature: ~20% speed boost on NVidia GeFroce RTX 4090, ~10% speed boost on macbook.
-
Tested geometry shader: 0-3% speed decrease with geometry shader.
-
FPS may vary depending on splat scale, splat distribution, etc.
-
NVidia GeForce RTX 4090, Windows
-
bicycle.ply
(total 6,131,954 points)View Visible splats Screen MSAA FPS view 1 1M 1280x720 NO 620 view 1 1M 1280x720 2x 480 view 1 1M 1280x720 4x 390 view 1 1M 1600x900 NO 560 view 1 1M 1600x900 2x 430 view 1 1M 1600x900 4x 340 view 2 2M 1280x720 NO 470 view 2 2M 1280x720 2x 400 view 2 2M 1280x720 4x 330 view 2 2M 1600x900 NO 460 view 2 2M 1600x900 2x 400 view 2 2M 1600x900 4x 330 -
garden.ply
(total 5,834,734 points)View Visible splats Screen MSAA FPS view 1 1.5M 1280x720 NO 530 view 1 1.5M 1600x900 NO 500 view 2 2M 1280x720 NO 470 view 2 2M 1600x900 NO 430 view 3 3M 1280x720 NO 370 view 3 3M 1600x900 NO 340 -
No MSAA gives huge FPS boost, without any quality loss. MSAA only affects opaque objects other than splats, such axes and grid.
-
View number is different from camera index in model json. I just randomly posed camera.
-
Small models such as
bonsai.ply
: 800~1000 FPS. -
Rendering quads are slightly (0-3%) faster than rendering with geometry shader.
-
-
Apple M2 Pro
-
MacOS is not my main target environment, but to just give some brief idea about rendering speed:
-
bicycle.ply
(total 6,131,954 points)View Visible splats Screen MSAA FPS view 1 1M 1280x720 NO 84 view 1 1M 1600x900 NO 76 view 1 1M 3200x1800 NO 51 view 2 2M 1280x720 NO 54 view 2 2M 1600x900 NO 52 view 2 2M 3200x1800 NO 40 - About 2x performance reported by UnityGaussianSplatting, 46FPS at 1200x800 with Apple M1 Max (note that my laptop is M2 Pro.)
-
garden.ply
(total 5,834,734 points)View Visible splats Screen MSAA FPS view 1 1.5M 1280x720 NO 78 view 1 1.5M 1600x900 NO 73 view 1 1.5M 3200x1800 NO 48 view 2 2M 1280x720 NO 62 view 2 2M 1600x900 NO 59 view 2 2M 3200x1800 NO 43 view 3 3M 1280x720 NO 44 view 3 3M 1600x900 NO 43 view 3 3M 3200x1800 NO 34 -
bonsai.ply
: 120FPS at 1600x900. 100FPS at 3200x1800. -
Geometry shader is not available. (
VkPhysicalDeviceFeatures::geometryShader = false
)
-
Like other web based viewer, it uses traditional graphics pipeline, drawing splats projected in 2D screen space.
One of benefits of using graphics pipeline rather than compute pipeline is that splats can be drawn together with other objects and graphics pipeline features such as MSAA.
- (COMPUTE) rank
- Cull splats outside view frustum, create key-value pairs to sort, based on view space depth.
- (COMPUTE) sort
- Perform 32bit key-value radix sort.
- Indirect dispatch, sorting only visible points. Not a big deal, sort time is negligible compared to projection/rendering step.
- (COMPUTE) inverse
- Create inverse index map from splat order from sorted index.
- This is for sequential memory access pattern in the next step.
- (COMPUTE) projection
- Calculate 3D-to-2D gaussian splat projection, and color using spherical harmonics.
- Using F16 Spherical Harmonics increased rendering speed.
- (GRAPHICS) rendering
- Simply draw 2D guassian quads.
- Speed up with indirect rendering, issuing only visible splats to draw command, reducing the number of shader invocations.
Projection and rendering steps are bottlenecks.
Current Onesweep radix sort implementation doesn't seem to work on MacOS.
https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portable.html
So I've implemented reduce-then-scan radix sort. No big performance difference even on NVidia GPU.
- https://github.com/aras-p/UnityGaussianSplatting : Performance report, probably similar rendering pipeline
- https://github.com/shg8/3DGS.cpp : Vulkan viewer, but tile-based rendering with compute shader.
- https://github.com/dendenxu/fast-gaussian-rasterization : Very similar rendering approach. They used geometry shader, while I used storage buffer in vertex shader to save memory.
-
Order Independent Transparency (OIT) doesn't work. I've tried Weighted Blended OIT (WBOIT). There are many nearly-opaque splats overlapped in a pixel, thus colors are blended in unsatisfactory manner. More importantly, OIT is slow.
-
Rendering guassian splats with 4x MSAA is slow. Turning MSAA off gives about 2x rendering time boost.
-
I've tried 4x MSAA and depth resolve for opaque objects in the first subpass and gaussian splat rendering with no MSAA in the second subpass, where 4x MSAA color/depth images are resolved to 1x MSAA images. Multisample colors are blended with background color into a pixel.
-
Directly updating to vulkan-cuda mapped memory in kernel is slower than memcpy (3.2ms vs. 1ms for 1600x900 rgba32 image). Regardlessly, it is better to manipulate swapchain image only in Vulkan. 1ms of copy cost is too much.
-
Rendering triangle list is 0-3% faster than geometry shader. Also, geometry shader is not available in MacOS Metal/MoltenVK. Rendering triangle list is better choice.
-
Using SH F16 storage increases speed by 20% on NVidia Geforce RTX 4090, 10% on Apple M2 Pro.
GUI is created in an off thread. According to GLFW documentation, the user should create window in main thread. However, managing windows off-thread seems working in Windows and Linux somehow.
Unfortunately, Apple doesn't allow this. Apple’s UI frameworks can only be called from the main thread. Here's a related thread by Apple staff.
- Windows or Linux (Doesn't work for MacOS.)
- conda: cmake, pybind11, cuda-toolkit (cuda WIP, not necessary yet)
$ conda create -n pygs python=3.10
$ conda activate pygs
$ conda install conda-forge::cmake
$ conda install conda-forge::pybind11
$ conda install nvidia/label/cuda-12.2.2::cuda-toolkit # or any other version
The python package dynamically links to c++ shared library file.
So, first build the shared library first, then install python package.
$ cmake . -B build
$ cmake --build build --config Release -j
$ pip install -e binding/python
$ python
>>> import pygs
>>> pygs.show()
>>> pygs.load("./models/bicycle_30000.ply") # asynchronously load model to viewer
>>> pygs.load("./models/garden_30000.ply")
>>> pygs.close()