... but GPUs aren't always the right choice

In the last post, the fictional Toilet Toilers Interactive found themselves in a bit of a pickle with their GPU physics-and-everything-else pipeline. They might be regretting some of their technical choices, but it’s worth looking back at some of the things they might have considered during development. Why might they have decided to walk such an unusual path, and why didn’t bepuphysics v2?

GPUs are indeed fast

Bepuphysics v2 is primarily sensitive to two architectural details: memory bandwidth and floating point throughput. Moving from a quad core 4-wide SIMD CPU with dual channel DDR3 memory like the 3770K to a 7700K with AVX2 and higher clocked DDR4 can give you huge speedups. That’s despite the fact that it’s still just a quad core and the general purpose IPC/clock improvements from Ivy Bridge to Kaby Lake aren’t exactly groundbreaking.

Suppose we wanted to go crazy, and suppose I also managed to address some potential scheduling issues on high core count systems. AMD’s EPYC 7601 could be a good choice. Even though Zen 1 doesn’t have the same per-core floating point throughput as contemporary Intel CPUs or Zen 2, the massive number of cores enable a respectable 600+ gflops. With 8 memory channels of DDR4 memory, you can pretty easily hit 3 to 4 times more effective memory bandwidth than a 7700K (120+ GBps).

Now consider a high end graphics card. The 7601 is on the premium end of the pricing spectrum, so we’ll look at a 2080 TI: 13.4 tflops and 616 GBps memory bandwidth. The Radeon Instinct MI60 (with a large chunk of HBM2) has about 1 TBps bandwidth.

Even if you drop the budget into more reasonable territory, the gap between CPU and GPU persists. Graphics cards are simply designed for massive throughput and are ideal for highly parallel workloads that can make use of their extremely wide hardware.

Graphics are clearly a perfect fit for graphics hardware, but graphics cards have become progressively more general in their capabilities. We’ve moved from fixed function transforms up to a level of generality spanning cryptocurrency mining, protein folding, ray tracing, and yes, physics simulation.

So, if the Toilet Toilers thought they could get such a huge benefit by embracing GPU execution fully, why would I build bepuphysics v2 for the CPU?

Algorithms and architectures

Achieving decent multithreaded scaling generally involves phrasing the computation as a giant for loop where every iteration is independent of the others. Bonus points if you can guarantee that every iteration is doing the exact same type of work, just on different data. This applies to both CPUs and GPUs, but GPUs will tend to massively outperform CPUs in this use case.

Case study: collision batching

We don’t always have the luxury of doing the exact same type of independent task a million times. Consider the narrow phase of collision detection. In bepuphysics v2, each worker generates an unpredictable stream of collision pairs. The type, number and order of the pairs are not known ahead of time. This introduces some complexity. One potential GPU implementation would be to have a pass dedicated to extracting all the jobs from the collision pair stream and adding them to big contiguous lists. After the pass completed, you’d have the list of every sphere-sphere pair in one spot, every sphere-capsule in another spot, and so on. It could look something like this:

batching.png

You could then dispatch the appropriate collision handler for each pair type. With a sufficient number of pairs, the GPU would achieve a decent level of occupancy.

But there remain some concerns. On the GPU, you’ve got potentially thousands of workers. Are you going to write a single shared set of type batches with global atomic counters? And you spent all that time writing everything in the first pass, only to then go back and read it all again in the second pass- that’s a lot of memory bandwidth used!

You’re not yet out of options, but now things start getting more complicated. Maybe you could create an intermediate set of type batches that accumulates a small local set in thread group shared memory that gets flushed to the main set only periodically, cutting down the number of global atomic calls. Maybe you do something clever with wave operations.

But even if the bookkeeping costs aren’t a concern, the theoretically pointless write-then-read issue is still there. Maybe you could take advantage of some features that are implied by hardware accelerated ray tracing. Ray tracing with unpredictable target surface shaders is actually a very similar problem, and as hardware adapts to support ‘callable shaders’ efficiently, options could open.

In CPU-land, we can just stick each pair into a thread local type batch. When the type batch count reaches some execution threshold- say, 16 or 32 entries- flush it. In most cases, that means the data in the type batch is still in L1 cache, and failing that, it will almost always still be in L2 cache. Swapping tasks from pair collection to executing the pair processor for a type batch does carry some overhead, but for a CPU, we’re talking about a few dozen nanoseconds on the high end. That doesn’t really matter when the execution of the type batch is expected to take a handful of microseconds.

As the workarounds showed, implementing something similar on a GPU is not fundamentally impossible, but it’s not natural for the device and you probably won’t be achieving that 13.4 teraflops number with a bunch of code like this (at least for a while). At a high level, CPUs are latency optimized and handle these sorts of small scale decisions easily, while GPUs are throughput optimized. That focus is reflected across the hardware, tooling, and programming model; going against the grain is uncomfortable.

Case study: dynamic hierarchy broad phase collision testing

Algorithms vary in their parallelizability. Some are stubbornly resistant to any kind of multithreading, others require a bit of effort to tease out the parallelism, and others are just embarassingly parallel. In many cases, when there exist sequential and parallel algorithms for the same task, the more sequential version has a significant advantage that can only be overcome by throwing quite a bit of hardware at the parallel version.

Bepuphysics v2’s broad phase contains an interesting example of this phenomenon. It uses a dynamically updated binary tree. To find out which collision pairs exist, all leaves of this tree must somehow be tested against the tree.

The naive algorithm is very simple and is a decent fit for a GPU or other very wide machine: loop over every collidable and test its bounding box against the tree. Every test is completely independent.

But note that we’re loading the tree’s nodes over and over again. The whole tree could be multiple megabytes, so tiny core-local caches won’t be able to serve most requests. Even without the extra cost of hitting more distant caches or main memory, that’s still a lot of bounding box tests. As always, there are some potential optimizations or workarounds, and as usual, they come with some complexity penalties.

There exists a very simple algorithm that avoids the need for most of this work. Consider two separate bounding trees- test the roots against each other; if they intersect, test all combinations of their children (for pair of binary trees, childA0-childA1, childA0-childB1, childB0-childA1, childB0-childB1); for the pairs that intersect, recursively continue the same process. The result is all overlaps between leaves in tree A and tree B.

Now, we can go one step further: run that same process, except between a tree and itself. Note that there’s no need to do many of the intersection tests because any node obviously overlaps with itself. Not only has the number of intersection tests dropped considerably compared to the naive approach, but we’ve also eliminated all the redundant node loads.

This algorithm is not horribly sequential- it spawns independent work recursively- but it’s not embarassingly parallel either. In other words, if you’ve got 64 fat cores, you can create enough worker tasks to fill the hardware cheaply. But, if your target architecture has many thousands of very weak threads, suddenly the setup work is significant.

(It’s worth mentioning that ‘threads’ exposed by graphics APIs aren’t mapping to the same sort of thread as you get on an x86 CPU. They’re more like lanes in wide SIMD units. You can take advantage of this sometimes, but again, going against the grain comes with a complexity penalty.)

So what does all this mean? To naturally fit a GPU, you’ll generally want to pick algorithms that suit its execution style. Sometimes this means paying an algorithmic penalty to use a more parallel implementation. In these cases, you sacrifice some of that crazy compute power in order to use it at all.

Of course, even a handicapped GPU can still beat a CPU sometimes.

Case study: bookkeeping

The least exciting parts of a simulation can sometimes be the hardest to bring over to extremely wide hardware. Consider the process of adding constraints to the solver based on collision detection results. Every constraint addition requires multiple sequential steps and separating them into even somewhat parallel tasks involves terrible complexity.

Each new constraint must identify a solver batch that does not share any bodies with the new constraint, then it must find (and possibly create or resize!) a type batch for the new constraint to belong in. And then it has to reach into the per-body constraint lists to ensure a traversable constraint graph. And so on. It’s tied for my least favorite part of the engine, and it’s all because I attempted to extract parallelism where there was very little available.

And then you have even more fun things like sleeping and waking. Beyond merely adding a bunch of constraints and updating all the related bodies to match, it also moves a bunch of bodies and updates the broad phase at the same time. (I’d link a bunch of stuff related to it, but I can’t be bothered to go find all the tendrils.)

Even with all that effort, the scaling is still pretty bad. Especially sleep/wake which bottlenecks badly on the broad phase modifications. I aim to improve that at some point, but I really, really don’t look forward to it. On the upside, our big fat latency optimized CPUs still do reasonably well. Bookkeeping is rarely a noticeable fraction of frame time unless you happen to be sleeping a 16000-body pile.

Without the benefit of big x86 cores to brute force through these locally sequential chunks, this work could become a serious problem. Trying to do all of this work on a few GPU ‘threads’ (in the HLSL sense of the word) is a nonstarter. Realistically, it would require a completely different approach to storage- or punting the work to a fatter core. It wouldn’t be enjoyable in any case.

The pain of asynchrony

Any time your simulation is executing out of step with some interacting system (like game logic), a blob of complexity is introduced. All interactions between the systems must be carefully managed. This applies to asynchronous use on both the CPU and GPU, but running a simulation on the GPU will tend to imply a thicker barrier. Even merely moving data across PCIe takes extra time.

Game logic often benefits from fairly tight integration with the simulation. Writing a callback that can modify the simulation state or report data on the fly gets a lot harder if that callback needs to execute on the GPU. In practice, a whole lot of simulation state will only be indirectly accessible. You can definitely still work with it, but everything gets a little bit harder.

There’s also the option of embracing it entirely like the Toilet Toilers did- execute even most of game logic on the GPUs. I might call you nuts, but you could do that. The helpfulness, speed, and stability of tools in GPU land typically struggle to meet the expectations you might have from working in CPU land…

All the asynchronous pain disappears if the only consumer of your simulation is graphics and your graphics and simulation are synchronized. Noninteractive debris, many forms of cloth, particles and other decorative simulations are great use cases for GPU physics that avoid asynchrony’s suffering entirely. That just wasn’t my target use case for bepuphysics v2.

Comparative advantage

In many games, all the crazy horsepower that a GPU brings to the table is already spoken for. Rendering 4K@120hz with ray traced global illumination is still beyond all current hardware, and by the time we have the hardware for it, we’ll be trying to render dense lightfields for VR or something.

Is it worth turning down some of that graphical fanciness to run physics? If we assume that a GPU implementation is 4 times faster and that all the concerns above are dealt with, we could move a 6 millisecond CPU simulation to the GPU for a mere 1.5 milliseconds. If the simulation and rendering are both targeting 144hz, then you’ve cut your rendering budget by 20%, but massively increased your CPU budget! … now what are you going to do with that CPU budget?

You could waste a bunch more time in the driver by spamming draw calls, I suppose. You could generate a wonderful number of draw calls with DX12 or Vulkan and multiple threads. Or you could use indirect rendering and a gpu driven pipeline…

Maybe you could do some fancy sound tracing and dynamic audio effects! But a lot of the queries would fit the simulation structures really well, and pulling that across PCIe every frame seems wasteful, and the actual math involved would fit the GPU…

Or maybe use some fancy machine learning techniques to animate your characters, but then GPUs (especially ones with dedicated hardware for it) do inference way faster than CPUs…

I don’t mean to imply that there’s nothing you could spend that CPU time on, just that a lot of heavy lifting does map well to the GPU. Despite that, there’s no point in leaving the CPU idle. It makes sense to give the CPU work that it can do reasonably well, even if the GPU could technically do it faster- because the GPU is doing something else already!

(Backend use cases can be interesting- the lack of rendering on the server could open up space for many more jobs. You’d still suffer from many of the other issues listed here, but at least you’d have the computational resources freed up.)

The deciding factor

I’ve gone through a variety of concerns with GPU simulations, but I’ve also tried to offer some workarounds in each case. It’s definitely possible to make a speedy GPU simulation- it has been done, after all.

But there’s one problem that I can’t find a way around. Making a complex GPU-heavy application run reliably across different hardware vendors, hardware generations, operating systems, and driver versions is extremely painful. Past a certain level of complexity, I’m not actually convinced that it’s possible without having a direct line to all the involved engineering teams and a few metric hecktons of dollars. I do not have that many dollars, and not many people know I exist; hitting a particularly dangerous driver bug can kill an entire line of research.

And, even if I manage to achieve a reasonable level of stability somehow, regressions in newer drivers- especially on older hardware- are not uncommon. The fate of the Toilet Toilers was only slightly exaggerated, and only in terms of temporal density.

I can try to report all the issues to all the different parties involved, but honestly, I can’t blame them if they don’t fix a problem even 5 years later. Some of these problems are just weird little corner cases. Some of them don’t apply to any modern hardware. It makes perfect sense to spend limited development resources on the projects that are actually going to drive many millions or billions of dollars in sales. (Maybe one day!)

You might notice that this line of thinking almost suggests that any GPU development is a lost cause for a small player, not just GPU physics. As frustrating as the ecosystem is, I’m not willing to go that far. There are a few important distinctions between graphics and physics.

In a physics simulation, the result of one time step feeds into the next, and into the next, and so on. If any stage fails at any point, it can cause the entire simulation to fail. For example, a single NaN will propagate through a constraint graph within a frame or two. Within another frame or two, the simulation will probably crash.

Graphics tend to be slightly more resilient. There are still types of bugs which can bring everything crashing down, certainly, but most of the milder bugs can be survived. They might just result in a bit of visual corruption here or there- once it’s out of any temporal accumulation windows, the problem may resolve.

In other words, in graphics-like tasks, data dependency chains tend to be short. Error does not have a chance to feed upon itself catastrophically, or the error is bounded.

Further, graphics engines often have lots of tuning knobs or multiple implementations of effects for different quality levels. Bugs can sometimes be worked around by fiddling with these knobs. Physics simulators usually don’t have as many degrees of freedom.

Finally, if you are working on graphics as a smaller developer, chances are you’re targeting just one API and operating system. Focusing on, say, DX12 on Win10 alone won’t be easy, but it’s not the nightmare that supporting other operating systems/APIs simultaneously would be. (The same can be said for some client-only GPU physics use cases, but if you want to run physics serverside, it’ll probably need to run on things that aren’t windows- because the server licensing model is the only thing worse than driver bugs.)

But even for graphics, the situation is nasty enough that using one of the super popular graphics engines is wise if you intend to target a variety of platforms. It’s a well-worn path that a thousand other developers have walked already; they have suffered for you.

Conclusion

If I built bepuphysics v2 to run on the GPU, I could probably make it faster. It would take a long time and I’d want to gouge out my eyes at many points during the process, but I could do it.

It wouldn’t match up with how I want to integrate it into other projects, and I have other things I can spend those GPU cycles on.

Plus, v2 performance already approaches (and occasionally exceeds) some GPU rigid body simulators, so I’m pretty okay with it.