Current Generation Parallelism
In Games
Jon Olick
id Software
Beyond Programmable Shading: In Action
Brief History of Parallelism
• 1 Processor
– The good old days.
– Why parallelize? Just wait a little and your programs
will get faster.
Beyond Programmable Shading: In Action
Brief History of Parallelism
• 2 to 3 Processors
– Logical splitting of game process into pipelined
pieces.
• Game
• Rendering
• Sound
• Loading/Decompression
Beyond Programmable Shading: In Action
Brief History of Parallelism
• About 6 to 8 Processors
– The transition to a job scheduling type architecture
– 1st order parallelism
• Game
• Rendering
• Sound
• Physics
• Collision
• Loading/Decompression
• Etc…
Beyond Programmable Shading: In Action
Brief History of Parallelism
• About 8 to 16 Processors
– End of CPU history.
– Enter 1998 in GPU history.
• Approx # of processors as average parallel scalar
operations.
– 2nd order parallelism
– Jobs which create and manage the resources of
other jobs.
• GPU Command Processor (DMA engine)
Beyond Programmable Shading: In Action
Brief History of Parallelism
• About 16+ processors
– 3rd order parallelism
– GPU Vertex Processors
– Jobs which create and manage the resources of
other jobs which create and manage the resources of
other jobs
Beyond Programmable Shading: In Action
Brief History of Parallelism
Riva 128 Riva TNT GeForce 256
NVidia
300
250
200
Approx # of CPUs
150
Nvidia
100
50
0
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Beyond Programmable Shading: In Action
Current State of Parallelism
• Desktop Processors
– Intel Core 2 Quad, 4 processors, 3.2 ghz, 102 Gflops
• Soon to be 8 core?
• Multimedia Processors
– Cell Processor - 8 processors - 3.2 ghz - 192.0 Gflops
• 1 main, 7 co-processors
• Graphics Accelerators
– GTX 280 - 1.296 ghz – 0.933 Tflops
• 240 stream processors
Beyond Programmable Shading: In Action
CELL BROADBAND ENGINE™
Beyond Programmable Shading: In Action
PLAYSTATION®3 Processor Overview
• Game
• Animation
• Geometry Processing
• Post Processing
• Occlusion Rasterization
• Sorting
• Collision Detection
• Fourier Transform
• (De)Compression
• Not going to cover all of these…
Beyond Programmable Shading: In Action
PLAYSTATION®3 Processor Overview
• Parallelize ordinarily sequential CPU
processing
• Assist in what is typically considered GPU
processing
Beyond Programmable Shading: In Action
Primary Programming Challenges
• Fitting code and data in the 256k local co-
processor memory
– Best solutions are ones that don't treat the 256k
local store as a typical on demand caching
architecture
• Scattered reads/writes bad, sequential reads/writes
good
• Software Pipelining
• Only 16 byte aligned reads/writes
• Synchronization
Beyond Programmable Shading: In Action
MD6 ANIMATION PROCESSING
Beyond Programmable Shading: In Action
MD6 Animation Processing
Game
Logic
Beyond Programmable Shading: In Action
MD6 Animation Processing
Game
Logic
Blending Tree
Generation
Beyond Programmable Shading: In Action
MD6 Animation Processing
Game
Logic
Blending Tree
Generation
Low Level Operation
List Generation
Beyond Programmable Shading: In Action
MD6 Animation Processing
Game
Logic
Blending Tree
Generation
Low Level Operation
List Generation
Low Level Operation Execution
Beyond Programmable Shading: In Action
MD6 Animation Processing
Serial
Game
Logic
Blending Tree
Generation
Low Level Operation
List Generation
Parallel
Low Level Operation Execution
Beyond Programmable Shading: In Action
MD6 Animation Webs
• Separates Thinking from Representation
– Game Object says what it wants to look like.
– Animation Webs take care of the rest.
• Unstructured graph
– Each node has a blend tree
• Designed with simplicity in mind
– Animators should animate, not fiddle with nodes.
– Extract as much information as possible directly
from the animation data.
Beyond Programmable Shading: In Action
MD6 Animation Processing
• Additive Blending
• Subtractive Blending
• Animation Algebra
– Blend Equations
• Animation blending trees in the form of an equation.
• Example equation:
– (animA + animB) – animC
Beyond Programmable Shading: In Action
Partial Animation Blending
• Generalized play an animation only on the
face, torso, etc…
• One weight per joint per animation
• Compute alpha for slerp via following
equation:
– For each joint
• Let w0 = weight of joint in animation A
• Let w1 = weight of joint in animation B
• If(w1 > w0)
– Let alpha = (alpha * w1) / w0
• Else
– Let alpha = ((w1 – w0) + alpha * w0) / w1
Beyond Programmable Shading: In Action
Varying parameter treatment
Beyond Programmable Shading: In Action
Varying parameter treatment
Beyond Programmable Shading: In Action
Varying parameter treatment
16k 16k 16k
Beyond Programmable Shading: In Action
GEOMETRY PROCESSING
Beyond Programmable Shading: In Action
Two modes of usage
• Primary mode
– Use offline tools
– Partition into vertex sets
– Use indexed triangles
– All features of pipeline can be used
SPU
Beyond Programmable Shading: In Action
Two modes of usage (cont)
• Secondary mode
– Data generated by other tools
– Formats other than indexed triangles
– Non-partitioned objects
– Subset of pipeline features can be used
SPU
Beyond Programmable Shading: In Action
SPU Geometry Pipeline Stages
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Vertex Decompression
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Vertex Attributes
Unique Vertex
Array 0
Instance Vertex
Array 1
Beyond Programmable Shading: In Action
Vertex Decompression
Unique Vertex Float Tables
Array 0
Instance Vertex
Array 1
Beyond Programmable Shading: In Action
24bit Unit Vector
• Smallest 2 compression
– Two smallest components with 10 bits each
• Encoded from –sqrt(2)/2 to +sqrt(2)/2
– Largest component reconstructed via
• Largest = sqrt(1 – smallestA2 – smallestB2)
• One additional bit for sign of largest component.
Beyond Programmable Shading: In Action
24bit Unit Vector
• Smallest 2 compression
– Two smallest components with 10 bits each
• Encoded from –sqrt(2)/2 to +sqrt(2)/2
– Largest component reconstructed via
• Largest = sqrt(1 – smallestA2 – smallestB2)
• One additional bit for sign of largest component.
• One more bit to represent W as +1 or -1
– For constructing bi-normal from normal and tangent.
Beyond Programmable Shading: In Action
N-bit Fixed Point with integer offsets
• Simple n.x fixed point values
– Per-segment integer offset
• Bit count may vary from attribute to attribute
Beyond Programmable Shading: In Action
Index Decompression
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Index Table Construction
• Index table is created by a vertex cache
optimizer
– Based on K-cache algorithm
• Number of vertex program outputs affects
Vertex Cache size.
• Four vertex mini cache most important
optimization factor
Beyond Programmable Shading: In Action
Index Cache Optimizer
• Our vertex cache optimizer produces very
regular index data
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11
9 8 11
Beyond Programmable Shading: In Action
Index Decompression
• Our vertex cache optimizer produces very
regular index data
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11
9 8 11
Beyond Programmable Shading: In Action
Index Decompression
2
1
Triangle Indexes
0 1 2
Beyond Programmable Shading: In Action
Index Decompression
1
0
Triangle Indexes
2 0 1
Beyond Programmable Shading: In Action
Index Decompression
• Before Rotation
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11
9 8 11
Beyond Programmable Shading: In Action
Index Decompression
• After Rotation
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
6 8 9
9 11 10
11 9 8
Beyond Programmable Shading: In Action
Index Decompression
Beyond Programmable Shading: In Action
Index Decompression
85% compression
6.5 : 1
Beyond Programmable Shading: In Action
Blend Shapes
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Blend Shapes in MLB® 08 The Show
Beyond Programmable Shading: In Action
Skinning
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Skinning on SPUs
void SkinVs(float4 inPosition : ATTR0, float4 weights : ATTR3,
float4 matrixIndex : ATTR4,
out float4 position : POSITION,
uniform float4 joints[72], uniform float4x4 modelViewProj)
{
position = 0;
for (int i = 0; i < 4; i++)
{
float idx = matrixIndex[i];
float3x4 joint = float3x4(joints[idx+0], joints[idx+1],
joints[idx+2]);
position += weights[i] * mul(joint, inPosition);
}
position = mul(modelViewProj, position);
}
Beyond Programmable Shading: In Action
Skinning on SPUs
30% Performance Improvement
Beyond Programmable Shading: In Action
Skinning on SPUs
30% Performance Improvement
Shadow map generation.... 70%!
Beyond Programmable Shading: In Action
Discrete Progressive Mesh
SPU Pipeline • Smoothly reduces the
Vertex Decompress triangle count as a model
Index Decompress moves into the distance
Blend Shapes • With discrete progressive
Skinning mesh, the LOD calculation is
done once for an entire
Progressive Mesh
object
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
At an LOD there are two types of
vertexes
Parent Vertex
Child Vertex
LOD = 0.0
Beyond Programmable Shading: In Action
As the LOD level decreases, the
children “slide” towards their parents
Parent Vertex
Child Vertex
LOD = 0.2
Beyond Programmable Shading: In Action
The children continue to move towards
their parents
Parent Vertex
Child Vertex
LOD = 0.7
Beyond Programmable Shading: In Action
At the next integral LOD, all child
vertexes disappear as do the triangles
Parent Vertex
Child Vertex
LOD = 1.0
Beyond Programmable Shading: In Action
Continuous Progressive Mesh
SPU Pipeline • Like discrete progressive
Vertex Decompress mesh, child vertexes move
Index Decompress smoothly toward their
Blend Shapes parents
Skinning • However, the LOD is
calculated for each vertex
Progressive Mesh
instead of just once for the
Triangle Culling object
Compression
Output
Beyond Programmable Shading: In Action
Vertex set about to undergo continuous
progressive mesh
Parent Vertex
Child Vertex, LOD 1
Child Vertex, LOD 0
Beyond Programmable Shading: In Action
A single vertex set can straddle several
LOD ranges
Parent Vertex
Child Vertex, LOD 1
Child Vertex, LOD 0
LOD = 1.0
LOD = 0.0
Beyond Programmable Shading: In Action
Vertexes move depending on their
distance
Parent Vertex
Child Vertex, LOD 1
Child Vertex, LOD 0
LOD = 1.0
LOD = 0.0
Beyond Programmable Shading: In Action
Triangle Culling
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Up to 70% of triangles do not contribute to
final image.
Beyond Programmable Shading: In Action
Off Screen Triangles
Beyond Programmable Shading: In Action
Back Facing Triangles
Beyond Programmable Shading: In Action
Zero Area Triangles
Beyond Programmable Shading: In Action
Zero Area Triangles
Beyond Programmable Shading: In Action
No Pixel Triangles
Beyond Programmable Shading: In Action
Triangle Culling
Beyond Programmable Shading: In Action
Multisampling adds some complications…
Beyond Programmable Shading: In Action
Culled
Beyond Programmable Shading: In Action
Triangle Culling
10% to 20%
Performance Improvement
Beyond Programmable Shading: In Action
Compression for Output
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Float Tables
Beyond Programmable Shading: In Action
When done, the vertex attributes are
compressed into one output stream
Float Tables
Output
Vertex Array
Beyond Programmable Shading: In Action
Output Buffering Schemes
SPU Pipeline
Vertex Decompress
Index Decompress
Blend Shapes
Skinning
Progressive Mesh
Triangle Culling
Compression
Output
Beyond Programmable Shading: In Action
Double Buffer
• Each buffer stores vertex
Vertex Vertex and index data for an entire
and and frame
Index Index
Data Data • SPUs atomically access a
for for mutex which is used to
Frame 0 Frame 1
allocate memory from a
buffer
• Uses lots of memory
Beyond Programmable Shading: In Action
It is possible to completely fill a buffer
Vertex
and
Index
Data
Data SPU
Beyond Programmable Shading: In Action
Double buffering adds a frame of lag
Standard Pipeline
Build Jobs Render on
Scan Out
on PPU RSX™
Build Jobs Render on
Scan Out
on PPU RSX™
Build Jobs Render on
Scan Out
on PPU RSX™
Build Jobs Process Jobs Render on
Scan Out
on PPU on SPU RSX™
Build Jobs Process Jobs Render on
Scan Out
on PPU on SPU RSX™
Build Jobs Process Jobs Render on
Scan Out
on PPU on SPU RSX™
Pipeline with Double Buffered SPU Processing
Beyond Programmable Shading: In Action
Single Buffering
• Uses only half the memory!
Vertex • Still possible to completely
and
Index fill the buffer
Data
for
Single
Frame
Beyond Programmable Shading: In Action
Single Buffering has a shorter pipeline
Build Jobs on SPU Processing/
Scan Out
PPU RSX™ Rendering
Build Jobs on SPU Processing/
Scan Out
PPU RSX™ Rendering
Build Jobs on SPU Processing/
Scan Out
PPU RSX™ Rendering
• Vertex and index data is created just-in-time
for the RSX™
• Draw commands are inserted into the
command buffer while the RSX™ is rendering
• Requires tight SPU↔RSX™ synchronization
Beyond Programmable Shading: In Action
SPU↔RSX™ Synchronization
Using Local Stalls
Command Place local stalls in the command buffer
Buffer where necessary
Draw 17 RSX™ will stop processing at a local stall
Local Stall until it is overwritten by new commands
Local Stall SPUs will generally stay ahead of the
RSX™, so stalls rarely occur
Local Stall
Local Stall
Local Stall
Local Stall
Local Stall
Other
Beyond Programmable Shading: In Action
SPU will overwrite local stalls when it
outputs a set of new commands
Command
Buffer
Draw 17
SPU
Local Stall
Local Stall
Local Stall
New
Commands
Local Stall
Local Stall
Local Stall
Local Stall
Put
Other
Pointer
Beyond Programmable Shading: In Action
Ring Buffers
Data
Start of • Small memory footprint
Free Area
• Will not run out of memory
Vertex End of
and Free Area
Index
Beyond Programmable Shading: In Action
RSX™ writes a semaphore once a
chunk of data has been consumed
• A command to write a semaphore needs to be
added to the command buffer after all
commands that use the data
– The value of the semaphore to be written is the new
end of free area pointer
Start of
Data 19 Free Area
Draw 5
Current
Semaphore New End
RSX™ of Free
Draw 6 Data 6
Execution Area
Semaphore
Draw 7 Data 14
Semaphore
Beyond Programmable Shading: In Action
Each SPU has its own buffer
Buffer 0 Buffer 1
Data 22
SPU 0
Data 8
Buffer 3
Buffer 2 Data 11 SPU 1 Data 13
Data 9 Data 17
Data 12 Data 16 Data 18
SPU 2 Data 23
Data 21
Data 19 SPU 3
Data 10
Data 7
SPU 4
Data 6 Data 15
SPU 5
Data 14 Data 20
Buffer 4 Buffer 5
Beyond Programmable Shading: In Action
Geometry Performance
Beyond Programmable Shading: In Action
Software Pipelined C with SPU Intrinsics
do
{
m1 = in1;
in1 = si_lqx(pIn1, offset);
m2 = in2;
in2 = si_lqx(pIn2, offset);
m3 = in3;
in3 = si_lqx(pIn3, offset);
temp2 = si_selb(m3, m1, mask_0X00);
si_stqx(out1, pOut1, offset);
temp3 = si_selb(m2, m1, mask_00X0);
si_stqx(out2, pOut2, offset);
temp1 = si_selb(m1, m2, mask_0X00);
si_stqx(out3, pOut3, offset);
offset = si_ai(offset, 0x30);
out2 = si_shufb(m2, temp2, qs_bCaD);
out1 = si_selb(temp1, m3, mask_00X0);
out3 = si_shufb(m3, temp3, qs_caBD);
} while(si_to_int(offset) != 0);
Beyond Programmable Shading: In Action
Software Pipelined C with SPU Intrinsics
do
{
m1 = in1;
in1 = si_lqx(pIn1, offset);
m2 = in2;
in2 = si_lqx(pIn2, offset);
m3 = in3;
in3 = si_lqx(pIn3, offset);
Up to 20x faster
than naive C/C++
temp2 = si_selb(m3, m1, mask_0X00);
si_stqx(out1, pOut1, offset);
temp3 = si_selb(m2, m1, mask_00X0);
si_stqx(out2, pOut2, offset);
temp1 = si_selb(m1, m2, mask_0X00);
si_stqx(out3, pOut3, offset);
offset = si_ai(offset, 0x30);
out2 = si_shufb(m2, temp2, qs_bCaD);
out1 = si_selb(temp1, m3, mask_00X0);
out3 = si_shufb(m3, temp3, qs_caBD);
} while(si_to_int(offset) != 0);
Beyond Programmable Shading: In Action
1 SPU
Beyond Programmable Shading: In Action
1 SPU
800,000+
Triangles Per Frame
at 60 Frames per Second
Beyond Programmable Shading: In Action
1 SPU
800,000+
Triangles Per Frame
at 60 Frames per Second
60% of which are culled!
Beyond Programmable Shading: In Action
Next Generation Parallelism
In Games
Jon Olick
id Software
Beyond Programmable Shading: In Action
GAME ENTITY PROCESSING
Beyond Programmable Shading: In Action
Game Entity Processing
• Current Generation
– Serial Processing of entities in a giant for loop.
for(int i = 0; i < numEntities; ++i) {
entity[i]->Think();
}
Beyond Programmable Shading: In Action
Game Entity Processing
• Current Generation
– Serial Processing of entities in a giant for loop.
• Next Generation
– Parallelism via Double Buffering
– Every entity runs in parallel with each other with no
dependency stalls.
– Each entity can only read from previous frame’s
results
– Each entity can only write to itself
Beyond Programmable Shading: In Action
Game Entity Processing
• Record the progress of the game and replay
to debug.
• Single thread and randomize processing of
entities to help find bugs.
• Can protect memory so that bad accesses
cause exceptions to enforce double
buffering rules.
Beyond Programmable Shading: In Action
Game Entity Processing
• What about entities which have dependant
entities?
• Bucketing and Synchronization Points
Beyond Programmable Shading: In Action 96
RAY
CASTING
THE NEXT GENERATION
Beyond Programmable Shading: In Action
Why Ray Casting?
Beyond Programmable Shading: In Action
Why Ray Casting?
• A good question…
Beyond Programmable Shading: In Action
Beyond Programmable Shading: In Action
• Back in Quake 1
– If you had to make a decision between an additional
CPU and a Graphics Card which would you choose?
Beyond Programmable Shading: In Action
• Back in Quake 1
– If you had to make a decision between an additional
CPU and a Graphics Card which would you choose?
– Why is this any different today?
Beyond Programmable Shading: In Action
• Back in Quake 1
– If you had to make a decision between an additional
CPU and a Graphics Card which would you choose?
– Why is this any different today?
– Its not any different.
Beyond Programmable Shading: In Action
Why Ray Casting?
• What value does it provide to developers?
Beyond Programmable Shading: In Action
Why Ray Casting?
• What value does it provide to developers?
– Shorter & Cheaper Development
– Higher Quality Games
Beyond Programmable Shading: In Action
Why Ray Casting?
• What value does it provide to end users?
Beyond Programmable Shading: In Action
Screenshot From E3 Rage Video
Beyond Programmable Shading: In Action
Screenshot From E3 Rage Video
Beyond Programmable Shading: In Action
Why Ray Casting?
Beyond Programmable Shading: In Action
Current State of Rasterization
Command Buffer
Vertex
Processing
Vertex Processing
Triangle Setup
Fragment Processing
Fragment
Processing
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Beyond Programmable Shading: In Action
Future of Rasterization
Multiple Cores
Command Buffer
Vertex Processing
Vertex Processing
Triangle Setup
Fragment Processing
Fragment Processing
Beyond Programmable Shading: In Action
Future of Rasterization
28 31 29 28 32 29 28 33 34 35 36 37 38 39 37 38 40 37 38 41
GPU Triangle Setup
Beyond Programmable Shading: In Action
Future of Rasterization
Command Buffer
Vertex
Vertex Processing
Processing
Triangle Sorting
Triangle Setup
Fragment
Processing
Fragment Processing
Beyond Programmable Shading: In Action
Future of Rasterization x2
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11 GPU Triangle Setup
9 8 11
9 8 12
13 14 15
15 14 16
16 14 17
16 14 18
16 14 19
Beyond Programmable Shading: In Action 121
Future of Rasterization x2
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11 GPU Triangle Setup
9 8 11
9 8 12
13 14 15
15 14 16
16 14 17
16 14 18
16 14 19
Beyond Programmable Shading: In Action 122
Future of Rasterization x2
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11 GPU Triangle Setup
9 8 11
9 8 12
13 14 15
15 14 16
16 14 17
16 14 18
16 14 19
Beyond Programmable Shading: In Action 123
Future of Rasterization x2
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11 GPU Triangle Setup
9 8 11
9 8 12
13 14 15
15 14 16
16 14 17
16 14 18
16 14 19
Beyond Programmable Shading: In Action 124
Future of Rasterization x2
0 1 2
0 2 3
3 2 4
3 4 5
6 7 8
9 6 8
10 9 11 GPU Triangle Setup
9 8 11
9 8 12
13 14 15
15 14 16
16 14 17
16 14 18
16 14 19
Beyond Programmable Shading: In Action 125
Why voxels, and not triangles?
Beyond Programmable Shading: In Action
Why voxels, and not triangles?
• Unique Texturing is possible with
rasterization
– Rage – idTech 5
Beyond Programmable Shading: In Action
Why voxels, and not triangles?
• Unique Texturing is possible with
rasterization
– Rage – idTech 5
• Unique Geometry is possible with
rasterization
– Progressive Mesh
Beyond Programmable Shading: In Action
Why voxels, and not triangles?
• Unique Texturing is possible with
rasterization
– Rage – idTech 5
• Unique Geometry is possible with
rasterization
– Progressive Mesh
• SVO Solves Two Problems in One
– Unique Texturing & Unique Geometry
Beyond Programmable Shading: In Action
Why is the control flow efficient?
Beyond Programmable Shading: In Action
Why is the control flow efficient?
Beyond Programmable Shading: In Action
Why is the control flow efficient?
Beyond Programmable Shading: In Action
Why is the control flow efficient?
Beyond Programmable Shading: In Action
Voxel Mip Mapping – Thin Walls
Beyond Programmable Shading: In Action
Voxel Mip Mapping – Thin Walls
Beyond Programmable Shading: In Action
Voxel Mip Mapping – Thin Walls
Beyond Programmable Shading: In Action
Voxel Mip Mapping – Thin Walls
Beyond Programmable Shading: In Action
Caveats of Ray-Tracing?
• “Primary rays cache, secondary rays thrash”™
– Importance sampling to the rescue!
• Ray Tracing != Ray Casting
Beyond Programmable Shading: In Action
Sparse Voxel Oct-trees
• Oct-trees as collection of maximal blocks.
– Related to run-length encoding.
– Variable splitting planes
Beyond Programmable Shading: In Action
Data Structure
• Disk Caching with Virtual and Physical Pages
Beyond Programmable Shading: In Action
Is Disk Caching a Valid Lever?
Beyond Programmable Shading: In Action 141
Hot Data Structure
• Disk Caching with Virtual and Physical Pages
Beyond Programmable Shading: In Action
Hot Data Structure
• Disk Caching with Virtual and Physical Pages
– Start out with a single virtual page.
Beyond Programmable Shading: In Action
Hot Data Structure
• Disk Caching with Virtual and Physical Pages
– Start out with a single virtual page.
– Render some voxels into the tree until page capacity
is reached.
Beyond Programmable Shading: In Action
Hot Data Structure
• Disk Caching with Virtual and Physical Pages
– Start out with a single virtual page.
– Render some voxels into the tree until page capacity
is reached.
– Split page into 8 sub-pages and attempt to add the
overflow voxel again.
Beyond Programmable Shading: In Action
Hot Data Structure
• Disk Caching with Virtual and Physical Pages
– Start out with a single virtual page.
– Render some voxels into the tree until page capacity
is reached.
– Split page into 8 sub-pages and attempt to add the
overflow voxel again.
– Store out virtual pages to disk.
– Load/Unload each page’s levels as necessary at
runtime.
Beyond Programmable Shading: In Action
Hot Data Structure
• Page capacity can be based on...
– CUDA's shared memory size
– SPU local store size
– Optimum disk streaming performance
– Minimum physical page memory
Beyond Programmable Shading: In Action
Virtual Page Fragmentation
• Traverse indexing oct-tree
– Write out pages according to optimal layout (breadth
first, depth first, etc...)
Beyond Programmable Shading: In Action
Physical Page Fragmentation
• Constantly loading / unloading data
fragments memory over time
• Bucket memory into sections and assign
each page to a section.
Beyond Programmable Shading: In Action 149
Page Optimization
• Execution time proportional to number of
blocks.
Beyond Programmable Shading: In Action
Page Optimization
• Execution time proportional to number of
blocks.
• Number of blocks can be reduced through
translation.
Beyond Programmable Shading: In Action
Page Optimization
• Execution time proportional to number of
blocks.
• Number of blocks can be reduced through
translation.
• Translating by 2n doesn’t affect any oct-tree
level smaller than 2n
Beyond Programmable Shading: In Action
Page Optimization
• Create scratch page with enlarged region
– 2n+1 x 2n+1 x 2n+1
• Apply successive translations of magnitude
power of 2 in the x, y, & z directions and
keep track of the number of nodes.
• Store off total translation for ray casting
adjustment.
• O(n * 22n)
– n is the number of levels in the oct-tree
Beyond Programmable Shading: In Action
Page Optimization
Beyond Programmable Shading: In Action 154
Page Optimization
Beyond Programmable Shading: In Action 155
Page Optimization
• Minimize outside nodes for faster casting
Beyond Programmable Shading: In Action
Page Optimization
Beyond Programmable Shading: In Action 157
Page Optimization
Beyond Programmable Shading: In Action 158
Page Optimization
Beyond Programmable Shading: In Action 159
Data Structure
• Different structures for editing, runtime and
storage.
Beyond Programmable Shading: In Action
Runtime Data Structure
• child offsets : 32
• diffuse rgb : 3
• specular scale/power : 1
• planes : 12
• normal xyz : 3
• pad : 1
• total : 52 bytes per node
Beyond Programmable Shading: In Action
Storage Data Structure
• children bit mask : 1
• diffuse rgb : 3
• specular scale/power : 1
• normal xyz : 3
• total : 8 bytes per node
Beyond Programmable Shading: In Action
Data Compression
• Compressing child bits
• Compressing Colors
Beyond Programmable Shading: In Action
Compressing Child Bits
0.06
0.05
0.04
0.03
0.02
0.01
0
0 50 100 150 200 250
Beyond Programmable Shading: In Action 164
Compressing Child Bits
0.06 0.08
0.06
0.04
0.04
0.02 Level 0 Level 4
0.02
0 0
0 50 100 150 200 250 0 50 100 150 200 250
0.05 0.08
0.04 0.06
0.03
0.04
0.02 Level 1 Level 5
0.01 0.02
0 0
0 50 100 150 200 250 0 50 100 150 200 250
0.06 0.05
0.05 0.04
0.04 0.03
0.03
Level 2 0.02 Level 6
0.02
0.01
0.01
0 0
1 51 101 151 201 251 0 50 100 150 200 250
0.05 0.08
0.07
0.04 0.06
0.05
0.03
0.04
0.02 Level 3 0.03 Level 7
0.02
0.01 0.01
0 0
0 50 100 150 200 250 0 50 100 150 200 250
Beyond Programmable Shading: In Action
Compressing Child Bits
• Split by oct-tree level.
• Entropy Encoding
Beyond Programmable Shading: In Action
Compressing Color Data
46
15 35 90 63 29
10 20 35 87 98 81 95 61 65 29 35 26 30 25
Beyond Programmable Shading: In Action
Compressing Color Data
46
-31 -11 44 17 -17
-5 +5 0 -3 8 -9 5 -2 2 0 6 3 1 -4
Beyond Programmable Shading: In Action
Compressing Color Data
• Split by oct-tree level.
• Quantization
• Entropy Encoding
• 8:1 expected compression ratio
Beyond Programmable Shading: In Action
Data Storage Size
• 1.15 bits of positional data per voxel
• Cost savings improves as triangle size
decreases.
• 160 bits per triangle in traditional format
– x,z,y,s,t all 32-bits
• 80 bits per triangle in compressed format
– x,y,z,s,t all 16-bits
• 72 bits equivalent per triangle in oct-tree
– (for next generation)
Beyond Programmable Shading: In Action
Generating the Data
• Every surface can enumerate into voxels.
– Triangles
• 3D Scan Conversion, Volume Projection, Subdivision
Beyond Programmable Shading: In Action
3D Scan Conversion
Z=0
Z=1
= Z=2
Z=3
Z=4
Beyond Programmable Shading: In Action
Volume Projection
Beyond Programmable Shading: In Action
Volume Projection
Beyond Programmable Shading: In Action
Subdivision
Beyond Programmable Shading: In Action
Subdivision
Beyond Programmable Shading: In Action
Subdivision
Beyond Programmable Shading: In Action
Generating the Data
• Every surface can enumerate into voxels.
– 3D Scan Conversion, Volume Projection, Subdivision
• Thick surfaces are unnecessary
Beyond Programmable Shading: In Action
Generating the Data
• Every surface can enumerate into voxels.
– 3D Scan Conversion, Volume Projection, Subdivision
• Thick surfaces are unnecessary
– Flood fill world and remove unnecessary voxels.
Beyond Programmable Shading: In Action
Generating the Data
• Every surface can enumerate into voxels.
– 3D Scan Conversion, Volume Projection, Subdivision
• Thick surfaces are unnecessary
– Flood fill world and remove unnecessary voxels.
• Generate geometry mip-maps
Beyond Programmable Shading: In Action
Generating the Data
• Every surface can enumerate into voxels.
– 3D Scan Conversion, Volume Projection, Subdivision
• Thick surfaces are unnecessary
– Flood fill world and remove unnecessary voxels.
• Generate geometry mip-maps
• Perform ray-tracing to light the voxels.
Beyond Programmable Shading: In Action
Using the Data
• For each pixel on the screen
– Shoot out a ray into the oct-tree and write out the
node number (and depth)
Beyond Programmable Shading: In Action
Oct-tree Ray Traversal
• Similar to KD-tree traversal. Clip the line
with the mid-planes only.
• Tree traversal with two lookup tables.
– One to find which nodes to intersect with a given ray
direction in a worst-case scenario.
– The other to determine the order of intersection.
• Faster than most stackless traversal
methods for CUDA.
Beyond Programmable Shading: In Action
LOD
• How to handle oct-tree mip-mapping and
when is it necessary to load additional detail
levels?
Beyond Programmable Shading: In Action 184
LOD - Stop Depth
• Stop Depth based on pixel and voxel size
[Wald07]
Beyond Programmable Shading: In Action 185
LOD - Stop Depth
• Stop Depth based on pixel and voxel size
[Wald07]
– Oblique surfaces have unnecessary extra detail
• Hurts casting performance by traversing detail that
you won’t see
• Hurts streaming performance by loading unnecessary
data
Beyond Programmable Shading: In Action 186
LOD - Post Process
• Ray casting outputs node indexes
Beyond Programmable Shading: In Action 187
LOD - Post Process
• Ray casting outputs node indexes
• A post process which looks at ratios of nodes
to pixels.
– Small feedback buffer (320x180) contains list of
pages which require additional detail.
Beyond Programmable Shading: In Action 188
LOD - Post Process
• Ray casting outputs node indexes
• A post process which looks at ratios of nodes
to pixels.
– Small feedback buffer (320x180) contains list of
pages which require additional detail.
• Up to 20% performance improvement
Beyond Programmable Shading: In Action 189
Post Process Blur
• Fixes the “Jam your head into a wall”
scenario.
Beyond Programmable Shading: In Action 190
Post Process Blur
• Fixes the “Jam your head into a wall”
scenario.
• Width of blur kernel related to size of voxel
on screen.
Beyond Programmable Shading: In Action 191
Rendering Dynamic Geometry
• With voxels
– Option 1
• Ray cast or rasterize a triangle mesh
• Transform to base pose
• Trace with local oct-tree
• Allows instancing of geometry
Beyond Programmable Shading: In Action
Rendering Dynamic Geometry
• With voxels
– Option 1
• Ray cast or rasterize a triangle mesh
• Transform to base pose
• Trace with local oct-tree
• Allows instancing of geometry
– Option 2
• Have two different oct-trees: static & dynamic
• Render both and merge results together with depth information
Beyond Programmable Shading: In Action
Rendering Dynamic Geometry
• With voxels
– Option 1
• Ray cast or rasterize a triangle mesh
• Transform to base pose
• Trace with local oct-tree
• Allows instancing of geometry
– Option 2
• Have two different oct-trees: static & dynamic
• Render both and merge results together with depth information
• With triangles
– Hybrid rendering via rasterization and deferred shading.
Beyond Programmable Shading: In Action
Depth Advance Optimization
• Render a coarse hull of the geometry into a
depth-buffer.
– Automatically calculate from voxel geometry.
• Start the ray casting at the depth-buffer
values.
Beyond Programmable Shading: In Action
Depth Advance Optimization
• Render a coarse hull of the geometry into a
depth-buffer.
– Automatically calculate from voxel geometry.
• Start the ray casting at the depth-buffer
values.
• Skips most of the traversal process.
– Up to 2x speed improvement
Beyond Programmable Shading: In Action
Depth Advance Optimization
• Render a coarse hull of the geometry into a
depth-buffer.
– Automatically calculate from voxel geometry.
• Start the ray casting at the depth-buffer
values.
• Skips most of the traversal process.
– Up to 2x speed improvement
– Less sensitive to scene complexity
Beyond Programmable Shading: In Action
Adaptive Sub-Sampling
• After rendering the scene, perform a Sobel
edge filter over the frame buffer to figure
out where additional rays would improve
the quality of the image.
• Cast additional rays.
• Repeat until 16 ms.
Beyond Programmable Shading: In Action
Adaptive Sub-Sampling Problems
• Inherently always sampling the most
divergent parts of the scene
• Can manage performance hit by sampling
highly aliased to less aliased in chunks
Beyond Programmable Shading: In Action
Infinite Surface Detail
• Oct-tree node's recursively point back in on
themselves to create an infinite amount of
detail
• Create detail octree sub-segments to
simulate rough, smooth, porous, sharp
edges, etc..
• Programatically simulate virtual detail
levels.
Beyond Programmable Shading: In Action
How much time to innovate?
• 1 year tools
• 3 months runtime
Beyond Programmable Shading: In Action
Expected Runtime Performance
• 33% of the time rendering characters / etc
• 66% of the time rendering world
• Ray-casting the world must complete in
~20ms for 30 FPS
• Theoretically possible on today's technology
at 720p and 30 fps (GeForce 8800 Series)
Beyond Programmable Shading: In Action
Expected Runtime Performance
• 33% of the time rendering characters / etc
• 66% of the time rendering world
• Ray-casting the world must complete in
~20ms for 30 FPS
• Theoretically possible on today's technology
at 720p and 30 fps (GeForce 8800 Series)
– Theory falls a little short of reality.
Beyond Programmable Shading: In Action
How would this affect a platform
launch?
• Generational skip in geometric complexity
• Next gen platforms 4 times better at least
• 60 FPS at 1080p with Anti-aliasing
Beyond Programmable Shading: In Action
Special Thanks
• Paul Debevec
– Light probes used with permission
• Dimitry Parkin
– www.parkparkin.com
• John Carmack
• Cass Everitt • Sony
• Mark Harris • Intel
• Nathaniel Duca • Nvidia
• Aaron Lefohn
• Mike Houston
• Tom Forsyth
Beyond Programmable Shading: In Action 205
Beyond Programmable Shading: In Action 206
Questions
Jon Olick (jon.olick@gmail.com)
id Software
Beyond Programmable Shading: In Action
References
– http://www.sci.utah.edu/~wald/Publications/2007/
MROct/download/mroct.pdf
– [Wald07]
Beyond Programmable Shading: In Action 208