Almost all modern GPUs use the Single-Instruction Multiple-Thread (SIMT) architecture. With SIMT a group of threads (or wavefront) execute the same instruction in lock-step. Apart from making it a pain to program, it also means that when threads execute different code paths (e.g., if/else), the GPU must execute each branch one-by-one instead of simultaneously. This is rather inefficient, so why use SIMT?

Here's a hint: all major GPU vendors have switched to SIMT, including AMD and nVidia. Since AMD and nVidia are vying for top place, they'll have done their homework, and picked the architecture that gives them the best performance.

Here are three ways that helps overcome the extra overhead of branching:

  1. Adjacent pixel similarity (coherence): Threads in a wavefront will operate on neighbouring pixels (or vertices). Neighbouring pixels are often part of the same surface. This means they will have the same material properties and will, therefore, likely execute the same code branches through the shader. When this happens all threads will execute the same code path(s), eliminating the overhead
  2. Reduced instruction fetch bandwidth: Fetching instructions from VRAM takes up bandwidth. Fetching 1000 or more instructions from totally different locations takes up even more. This could clog up the memory (and even cache) bus, resulting in the GPU having to wait for instruction fetches. Having all threads in a wavefront execute the same instruction reduces the required bandwidth by the wavefront size (typically 32/64 threads per wavefront), which decreases the chance that bus congestion slows down the GPU
  3. Latency hiding: The GPU has multiple wavefronts waiting. The GPU switches to another wavefront whenever the current wavefront is stalled due to things such as texture fetches. This hides the latency by executing parts of other wavefronts while the current wavefront is stalled. The net result is that executing the entire workload (e.g., all pixels on the screen) takes less time than if the GPU sat idle while the current instruction is waiting

The net result of the above is that the GPU's performance ends up coming out ahead in most cases, than if they used other architectures.