Years ago someone asked me if I would add the ability to write shader assembly to the graphics drivers I was writing. My reply was: you really don't want to do that. I've been working on adding switch support to Warp3D Nova's drivers, and decided to explain why writing GPU code is hard:

  • Modern GPUs have Single-Instruction Multiple Thread (SIMT) architectures
  • This means that multiple threads execute the same instruction in lockstep (typically in groups of 32/64 threads)
  • Threads may need to execute different code paths (e.g., if/else), but all threads must execute the same code
  • So, you need to execute all paths, enabling/disabling threads using the EXEC mask
  • For loops need to keep track of which threads have exited the for-loop (using a loop active thread mask)
  • Functions need to keep track of which threads have exited (using a function active thread mask)
  • Pixel/fragment shaders may kill/discard pixels, so you need to keep track of which threads have permanently exited (using yet another active thread mask)
  • Even if you can handle that, you can have if/else nested inside for-loops, switch statemants, and function calls, which in turn are nested inside something else
  • Control-flow instructions (if/else, etc.) end up distributed through the code
  • All branches, returns or other control-flow need to be analysed, so you can insert conditional
  • You need to "think in parallel" when writing SIMT code

The end result is that it's very easy to get it wrong, and hard to debug because you end up with masses of code, and it isn't doing what the code looks like it's doing (e.g., a conditional branch is not implemented as a conditional branch). The code looks like it's single-threaded, but you're executing multiple threads at once, in lock-step.

P.S., watch the video for a fuller explanation.