2023 summary

My graphics programming focus this year was mostly on SDF evaluation and splat rendering but I did have some fun side projects!

January

At the start of the year I was working on the sdf-editor and not a ton of context was saved because I didn't have this website and I wanted the results to be a surprise. This was a huge mistake, I need to get better at showing progress externally.

sdf-editor group register machine thinking

more register based musings

Suffered from a simple change (adding a couple properties) that ballooned into larger refactor. The problem was that there is no easy way to find all of the places a property needs to be referenced or copied. This refactor moved me towards using non-optional switch() statements that force every enum to be present. This adds more code, but atleast adding a propery that needs to be plumbed through the system will cause compile time errors instead of having to find all of the issues at runtime.

the issue and the resulting madness

Februrary

I needed a distration and happened to watch a Pezza's work video Writing a Physics Engine from scratch - collision detection optimization where they show a very large number of particles interacting. I got nerd sniped!

~14ms for 1300 particles, ew!

~20ms for 10,000 particles by reducing the grid size


    ~10ms for 40,000 particles by building the grid once per frame and using AVX2
    on a single thread

And then it was back to the sdf-editor, which I was working on in parallel with the particle sim.

March

making the panel handle multi-select

pre-alpha user testing
  takeaway: this thing is pretty hard to use

Some things were wearing on me with the sdf editor

Objects were limited to the size of the voxel box they were contained in. I believe this was 512³
In my user testing people found it hard to manipulate objects
After ~200 objects everything became laggy
The elephant in the room: composite brushes made it much more complex to optimize the evaluator, re-enforcing the 200 object limit.
Controls / Panel were somewhere inbetween direct manipulation and precise control like you'd see in CAD, but it did neither very well. I think there is room in an editor to do both, but I missed the mark on this one.

So, I scrapped the whole thing and started designing a new system that wouldn't be forced to live inside of a voxel grid. My first thought was of panic because living on a grid is comfy. You don't have to worry about isosurface extraction and all of the degeneracies (e.g., topological issues, manifold issues) or the optimization of connecting graphs together to save on vertex bandwidth.

So after taking another glance at the current state of the art which seemed like Digital Surface Regularization With Guarantees, I noticed that in their paper they have missing triangles!

Anyway, I watched Alex Evan's Learning From Failure talk again and though I'd give max-norm a try. One thing that I thought would be a showstopper is if I couldn't figure out how evaluate objects that had been rotated under the max-norm distance metric.

According to Simon Brown the distance metric for a rotated object is the same as finding the largest aligned cube within a unit rotated cube (see: the Max norm ellipsoid shadertoy).

So I gave that a shot and came up with these two things:

2D: fit a unit square into a rotate unit square on desmos
3D: fit a axis aligned cube inside of a rotated unit cube
TODO: I made these before this website was launched, so it would be beneficial to port them over

A plane of rotated boxes
  Euclidean single thread 1223ms boxes(561) steps(14794498) leaves(4056562)

A plane of rotated boxes
  MaxNorm single thread 1037ms boxes(561) steps(14090824) leaves(3812840)

Things were pretty speedy under a release build

scene(plane of boxes and spheres)
  Euclidean/Scalar single thread 186ms primitives(561) steps(22517064) leaves(6307840)
  Euclidean/IntervalArithmetic single thread 386ms primitives(561) steps(20804936) leaves(5726208)
  MaxNorm single thread 291ms boxes(561) steps(20804936) leaves(5726208)
  

scene(plane of rotated boxes)
  Euclidean/Scalar single thread 211ms primitives(561) steps(14793252) leaves(4055356)
  Euclidean/IntervalArithmetic single thread 1053ms primitives(561) steps(19905185) leaves(5364673)
  MaxNorm single thread 167ms boxes(561) steps(14089012) leaves(3811412)

April

After a brief stint writing a gpu evaluator, I turned my focus back to CPU evaluation because it is more general purpose. It is easier to run headless and on older machines. At this point the major downside of the evaluator is that it still operated over a fixed sized grid, building an octree. However, the multithreaded octree building approach that I was considering for the GPU scared me into avoiding it on the CPU.

64k random boxes

64k random boxes - heatmap for primitive overlaps (1..8)

I came up with the notion of having each thread running over an implicit grid, filtering primitives based on the grid cell and if there were overlaps then the thread subdivides the cell recursively to find the leaves. On the cpu this makes a ton of sense because we have large caches and stacks are basically free. By limiting the size of the chunks we also limit the depth of the traversal which makes tuning this with a single knob possible.

The Chunker^TM

identifying the chunks associated with regions of the sphere

box of rotated boxes

finally I've broken free of world size limits

At this point things were working pretty well, but my meshing strategy was a mess. Each chunk was being meshed, and since we don't know whats going on in the neighborhood (without dependency tracking or some other complexity) the mesher would mesh the inside and the outside of the isosurface. This effectively means that we're paying double the cost for meshing and double the cost for rendering.

To improve on this I thought back to one of many invaluable tips that Sebastian Aaltonen mentioned on twitter about generating cube vertices in a vertex shader. By implementing this I was able to reduce the meshing time by ~80% (354ms -> 81ms) and the rendering time by ~60% (40ms -> 16ms) for 15 million leaves!

May

// tweak the svo sampler to pull out some of the lattice
static f32
ChunkSVOSample(const ChunkSVO *svo, i32 x, i32 y, i32 z) {
  if (x < 0 || y < 0 || z < 0 || x >= 64 || y >= 64 || z >= 64) {
    return 1.0f;
  } else {
    return 0.0f;
  }
}

Now, the next thing bugging me was my lack of understand of how the evaluator was working in the octree. It still wasn't super fast, but I didn't have any data to inform how to optimize. So I decided to shine some light on what the evaluator was doing at each level of the octree by aggregating the data into a histogram.

Chunker histogram

At level 0, the prims per cell is the highest, but this ratio quickly drops down and we end up swapping from performing many primitive evaluations per cell to many cell evaluations against very few primitives.

4x faster with very basic avx512 usage
  Prefiltering primitive bounding boxes against the chunks

Initially I wrote the visualization side of the evaluator as just a way to debug what it was doing. As things progressed and I added more and more functionality to it, I realized that this thing was actually fast enough to use interactively.

added g-buffer support

added leaf id to the g-buffer

deferred image based lighting

bug with normals for added primitives

fixed bug with normals for added primitives

Now that I had a g-buffer, we turn once again to Sebbbi's twitter, this time for Hi-Z Culling. The basic idea is that you use the previous frame's depth buffer to cull objects in the current frame that would be behind last frame's objects.

hi-z culling from GPU-Driven Rendering Pipelines

Results:

before hi-z culling: 16ms per frame

after hi-z culling: 3.6ms per frame

brush preview: additive sphere

brush preview: additive cube

So at this point, the brush preview was functional and the brush itself was locked to the surface under the mouse. So it was time to try and actually use the thing to make something other than squiggles.

my first sculpt with this system

The main issue with the system at this point is finding the surface under the mouse. I was doing a readback from the GPU which was causing a noticable lag. So I spent some time trying to optimize that readback, but was unable to get it to feel right. I'll probably have to revisit this at some point.

.las file (LIDAR) exporter and houdini import

video by @rh2 - drawn in my puny sdf editor, simulated in houdini, and  rendered in blender
check out their awesome SBSAR (substance designer) materials on swizzle.fun

At this point I'm really happy with the performance and how fast it is to spam new primitives. The chunker made evaluating dirty/new regions super convenient. So I started adding support for materials, starting with global materials just to prove that I had the pipeline setup and things looked reasonably close to what I'd expect. My assumption is that this is a rough pass that I'll have to keep iterating on.

gold-like material

plastic-like material

With the global materials working, it's time to attach a material to every leaf based on the associated primitive!

graph coloring of primitive id for material color

editable materials

And because we now have user selectable color, the .las exported gets support.

cuts take the cutter material

Learning from the earlier user testing of the earlier sdf editor, I opted to start working on a menu that can house all of the actions you can take. Hotkeys can be added later. The idea was a multi-level radial menu, in the pixel art aesthetic.

radial menu

Well that worked really well, how about tackling the brush manipulation problem?

global vs local transform icons

June

janky max-norm lines based on Segment - distance L-inf

toadstool life (aka I added cones!)

I was having some trouble making things that looked really good by hand, and I knew that I wanted some trees so I decide to look into how to procgen some trees.

Modeling Trees with a Space Colonization Algorithm

a generated tree

a generated tree with leaves (spheres) @ 30k total sdf primitives

leaf filter + better colors

inside looking out

And of course we can't leave the .las exporter out. The above tree exports to ~6M points @ 170MB (download the 16MB zip here). This really makes houdini struggle, maybe they should invest in some Hi-Z Culling haha.

houdini hates this pointcloud

The tree generator generates some really gnarly primitive lists which make for a wonderful benchmarking tool.

initial tree evaluator timings

bounds filtering at chunk octree level 0 - huge (~4x) improvement!

The next performance problem is when modifying the tree we need to filter out all of the leaves that may have been affected. At this point it was a scan over every leaf which was quite slow!

leaf filtering takes ~230x longer than the eval!

The solve for this was filtering at the chunk level instead of globally filtering every leaf.. seems obvious looking back!

chunk based leaf filtering - effectively free

add multi-threaded worker queue - 4x improvement @ 16 threads, not quite ideal.

rendering perf 2-4x improvement by compressing the render list instead of bitmask + degenerate triangles

With that, it was time for some user testing...

chicken head

Things were going pretty well, so I kept pushing. What is better than one monolithic object in the scene? Many instances of many objects. So I started thinking about how the indirections needed to work.

object instancing design

With a loose goal in mind, I took the first step: add instances which are effectively ids that reference the head of a leaf cluster linked list.

first attempt at object instancing (no culling eeeeek!)

object instancing with leaf cluster hi-z culling

30% SOL on PES+VPC throughput - EW!

Turns out, instancing large fields of points means that MANY triangles are being rendered at sub pixel sizes. To prove this I made boxes that are 1px or less output a constant material - I do better when I can see what is going on.

you don't have to go very far to be 1px (silver shaded in the background)

github/m-shuetz/compute_rasterizer can splat 100M points in ~4ms on my machine - I think this approach will do nicely for rendering distant objects. This project is based on the paper: Software Rasterization of 2 Billion Points in Real Time

The following picture is 10 copies of the tree, overlapped in space which is pretty much the worst case for this technique, as atomicMin is under higher contention.

The .las exporter comes in clutch again as compute_rasterizer is built to load LIDAR files, yay!

splatting in compute is hella fast!

So, obviously I need to write my own compute based point splatter.

splatting in compute (0.37ms)

rasterizing cubes (5.85ms)

Clearly there is a tradeoff between rasterization and splatting here.

splats: better when the size of the leaf is <= 1px

raster: better up close

So let's make a hybrid renderer that plays to each rendering technique's strengths!

raster (1.39ms) compute (0.20ms)

Wait a second, that is like reallly really fast!

how fast??

100 instances: raster (2.31ms) compute (2.19ms) or ~4.5ms total

1000 instances: raster (6.3ms) compute (6.3ms) or ~12.6ms total

Note: this is without instance based Hi-Z culling!

At this point I'm so stoked I threw 1000 trees into a scene.

160 instances: raster (4.6ms) compute (988.3ms) or ~992ms total

Ok good, we have a great baseline to improve on. Hi-Z culling at the object level should help quite a bit.

However, there is still more work to be done before jumping into optimizations. At this point I could only instance a single object at a time and I wanted to move to a viz-buffer to reduce the texture bandwidth.

instancing multiple objects into a viz buffer

instancing multiple objects into a viz buffer with normals

After reading through the Nanite paper one of the takeaways for me was that cluster lods were of huge importance to reduce the number of primitives that need to be rendered. So the first step is to get it working, which I did using a simple roulette style sampling

40k wiffle cube instances at fixed lod=3 all splats

moving too close results in a cloud of dust

first round automatic lod picking

and the associated nsight snapshot

After adding the normals back I got really excited about how many instances I could render!

100k undulating wiffle cube instances @ ~2ms. Clearly there are issues with lod selection!

better hiz culling (single sample from HZB) and better splat size

a view from a different perspective with the hzb viewport locked

Unfortunately there are still a ton of holes in the splat renderer. I spent quite a while trying to patch these holes.

when splats are too small it creates holes

perfect splat sizes at the cost of 130ms per frame!

Why is the optimal splat size so innefficient?

nsight capture says: you're memory bound.

Here's nsight capture if you are curious: fixing-splat-holes-perfect-at-130ms.zip.

So instead of getting sucked down an optimization rabbit hole, I decided to try and fix the original problem with holes and quickly realized that the leaf clusters were also a source of holes.

cluster shaped holes

cluster shaped holes (cluster coloring)

avoid enqueing empty octree nodes while building clusters!

After this there were a few more cluster generation related bugs, some tuning of splat size and raster cut-off, and of course some more Hi-Z tweaks.

1000 trees to debug Hi-Z occlusion culling

At this point things are working pretty well, but instead of adding materials back I decided to go on a tangent of actually using this system to do some procedural generation.

Starting with a leaf with the form designed using this Ovate Leaf Form desmos calculator

a single ovate leaf built by spamming cylinders

Since I've been working so hard on implementing instancing, I made a little bush of these leaves.

a bunch of ovate leaf instances

You may have noticed that there were some artifacts in the above image and I bet you couldn't of guessed that they were cluster lod related. So that got fixed next.

At this point, things appear to be working pretty well. On larger scenes the thing that computes what clusters at what lod (e.g., the viz: generate cluster instances profiler line) dominates the time spent and prompted me to to atleast consider a different approach.

Back to procgen, I decided to spend a bit of time exploring continuous noises.

collecting / implementing noises

Bah, I found some issues with my Hi-Z implementation, so I added a seeing tool in the form of a top-down view. The idea is pretty simple, for every potentially rendered cluster draw a rect at a certain location on the screen which acts like an orthographic perspective. The trick is getting the scale / bounds correct but it is a very low effort way of inline debugging. This happens in the filter-cluster-instances compute shader which has been passed a debug texture that gets rendered over the scene as a final pass.

left: cluster culling info
  yellow = frustum culled
  red = depth culled
  green = passing

middle: jet colored linear depth
right: random colored hiz mip level

Hi-Z debug view

another example of Hi-Z debug view

This debug texture came in clutch so many times, I also used it as X-Ray vision to see things that were being culled by Hi-Z. It also gave me an easy way to overlay the Hi-z texture and all of it's mips.

drawing rectangles around culled instances

July

Insipired by the Modular Ruins C Unreal Engine asset pack, I decided to start procedurally generating some stone/brick tilesets.

building some re-usable tiles

Now with all of these tile sets I can start building some primitive things, by manually instancing objects and positioning them in the scene.

an attempt at ramparts

The staircase in the back right of the scene was quite a pain to layout in code so I started to consider the possibility of procedurally placing tiles. Since I was just working on a staircase, why not make a spiral?

a simple spiral - no local orientation changes

a spiral of trees

add in local transforms

Why not linear instancing?

linear instancing with a twist

Procedural generation at the instance level is actually a ton of fun! I got to thinking about the tree and how it had no leaves. The original spheres as leaves was not great and I'd like to instance clusters of leaves and attach them to the ends of the branches. Hrm, how do I do that if the object is already done generating? A list of locations need to be stored along with the object that can denote where things can be attached... I think other engines call this a socket system, so I'll go with that!

tree sockets

tree sockets with a different seed

Proud of myself, I set my next goal to bring the materials and image based lighting back.

IBL with global roughness/metalness and cluster coloring

add materials to the tree and attach wiffle boxes

add materials to the tree and attach spheres

Things are starting to look better, but the tile materials are so boring!

procgen more tiles

I love these randomly jittered orientations!

I have a goal of being able to perform CSG on object instances, but noticed that a cut would propagate its material instead of what I'd expect in some cases which is: show the inside of the object.

allow cuts without applying material

add a 2 meter tall person at 500 splats per meter

add stb to the noises app

I had recently watched Wave Function Collapse in Bad North and was inspired to give WFC a try.

visualize uncertainty

doing something.. just not the right thing

closer

After making this interactive, it was hard not to play with!

So that was a nice distraction, but the result doesn't really have immdeiate applicability into the main project, so I moved on to material blending.

replace primitive ids with material hash

before

graph colored material hashes

degenerate case - fully smooth + metalic red blended with rough green

Objects Made of Objects

modifying an existing object

Instances of Instances

instances of instances (graph)

Turns out instances of instances implemented via *AddChild has some really annoying properties, the biggest one is that you can form graphs. So I took a page out of Valve's hand book and limited entities to having a single parent.

instances of instances (tree)

SDF Operations: shell and elongation

Debug Rendering

I had been adding a bunch of common code to Dust when I realized that I barely had any sort of debugging primitives (e.g, sphere, cone, box, etc..)

debugging lines / circles / points

debugging spheres / cylinders / boxes

debugging box w/ dimmed backfaces

debugging sphere / cylinder / capsule dimmed backfaces

debug text labels

eliminating bogus event models for Dust

To prove this system out, I built a little top down shooter game over ~2 weeks using the debug geometry for and an entity system inspired by the source engine (via the wiki)

This is a WIP, if you found this and want more, let @tmpvar know!

August

I started thinking about all of this stuff I've been doing from the past couple years and realized that from an external perspective it looks like I just vanished. I've grown very tired of social media websites and the psychological issues that it causes.

first post on the new tmpvar.com!

I also spent some time designing and cutting out some mechanical bits for a proof of concept human input device that works like a pen but has force feedback.

designing an encoder wheel

designing a modular arm segment

tumbled parts

September

added an overhead light to the cnc desk!

Simon Brown posted a tip about using the signed bit to differentiate between -0 and +0 which makes isosurface extraction more robust.

As soon as I saw it I knew it was the source of some issues that I had been seeing but hadn't been able to identify.

const IsNegative = (function () {
  let isNegativeScratch = new DataView(new ArrayBuffer(8))
  return function IsNegative(value) {
    isNegativeScratch.setFloat64(0, value, true)
    let uints = isNegativeScratch.getUint32(4, true)
    return (uints & 1 << 31) != 0
  }
})();

static inline bool
IsNegative(f32 a) {
  u32 *p = (u32 *)&a;
  return (*p >> 31) != 0;
}

fn IsNegative(v: f32) -> bool {
  let bits = bitcast<u32>(v);
  return (bits & (1<<31)) != 0;
}

Applying this to the chunker almost cut the number of collected leaves in half (1.7M -> 900K) making evaluation ~2x faster. What a huge win from such a simple concept! Thanks SJB!

negative bit trick for much win

negative bit trick grab bag testing

weird landscape made of spheres and perlin noise

Radiance Cascades

On August 8th, Graphics Programming weekly - Issue 299 - August 6th, 2023 popped into my inbox and and the first item is a talk by Alexander Sannikov at Exilecon. Mixed in with a bunch of other techniques that I really want to try was this notion of radiance cascades - a constant time GI implementation. Given that I've been stuck using Image Based Lighting for the splat renderer, this was really interesting. I'd really like to add dynamic lighting.