Alain Thellier recently ported his Cow3D demo/test-program to Warp3D Nova, which allows Warp3D Nova and Warp3D to be compared. Results have been coming in thick and fast, along with speculations as to why the results are what they are. I was expecting Nova to deliver a larger boost, and it actually does; it's just partially hidden. So, let's see if we can make sense of it...

Here's a sampling of what people are typically getting:

Hardware Warp3D Warp3D Nova Boost
Sam460ex + Radeon HD 7750 130 fps 352 fps 2.7x
A1-X1000 + Radeon R7 250X 154 fps 448 fps 2.9x
A1-X5000 + Radeon R9 280X 354 fps 775 fps 2.2x

So Nova gives the Cow3D scene a 2-3x performance boost. Not bad, although not as large as expected. Also noted is that high-end GPUs (e.g., Radeon R9 280X) are getting very similar results to mid-range GPUs (e.g., Radeon R7 250X) when using the same CPU.

So what's happening? Why does the GPU power not seem to matter? And why is the X5000 doing so much better? To make sense of this you need some understanding of what affects GPU performance.

GPU Performance and Bottlenecks

Modern GPUs are complex beasts containing with multiple systems. Render commands and data pass through multiple steps before they finally become pixels on the screen. Each step has a maximum through-put, and if that limit is reached then that step becomes the bottleneck that limits performance.

Here are some of the bottlenecks that can be encountered:

  • CPU bottlenecks:
    • CPU power - becomes a bottleneck when the CPU performs lots of calculations to generate the next frame. This includes updating the poses of all objects, performing memory/object management and generating the GPU's command stream
    • PCIe bandwidth - commands, vertex/texture/shader-constant updates all need to be sent to the GPU. This becomes a bottleneck when the GPU processes incoming commands & data faster than the CPU can send it. On AmigaOS the this limit is currently lower than it could be (as at Jan. 2017) because the CPU writes the data to VRAM rather than getting the GPU to fetch it using DMA (would love to fix this, but it'll take time)
    • Render-calls/s - there's an upper limit to how many render calls the driver can send the GPU, which is a combination of CPU power and PCIe bandwidth. Apart from generating and sending the commands, there's also memory/object management to be done (locking/unlocking/tracking buffers)
  • GPU bottlenecks:
    • Command-processor speed - there is a maximum limit on how many incoming commands/s the GPU can execute
    • VRAM bandwidth - reading/writing vertices, textures and pixels costs bandwidth; when the limit is reached then the GPU can go no faster
    • Vertex fetch rate - the vertex fetch unit can have its own upper limit independent of VRAM bandwidth (vertices/s)
    • Texel fetch rate - the texture fetch units also have their own maximum rate (texels/s)
    • Fill rate - the render output unit has a limit on how many pixels it can write (pixels/s)
    • GPU power - there's a maximum number of instructions per second a GPU can perform (GFLOPS)

As you can see, there are a lot of potential bottlenecks. Which one(s) you hit depends on what you're rendering:

  • Render lots of tiny objects with few vertices and you'll likely hit the render-calls/s limit
  • Render models with huge numbers of vertices and eventually the GPU's vertices/s limit will become a bottleneck
  • Render a few objects at very high resolution and the fill-rate may become the bottleneck
  • Finally, use large complicated shaders and you may hit the GPU's GFLOPS limit

The last 3 bottlenecks depend on how fast or slow your GPU is. Low-end GPUs will hit those limits much earlier than high-end ones.

Which Bottleneck Limits Cow3D's Performance?

Short answer: PCIe bandwidth is the main bottleneck. Medium and high-end GPUs have similar performance because they're all rendering faster than the CPU can feed them. Warp3D Nova reduces the bandwidth by storing the Cow3D model's vertices are already in VRAM, but there's still commands and data to transfer. The command stream is bigger than you think (lot's of state to set up and maintain), and there's also data such as the cow's orientation that gets sent to the GPU every frame.

That's only part of the picture, though. I didn't realise it initially, but there's also something else consuming PCIe bandwidth: the fps counter and info-bar at the top of the window. Text rendering is still done in software on the CPU and, consequently, is relatively slow. It also consumes a fair bit of PCIe bandwidth. Cow3D's fps counter measures the overall fps including the time taken to render the info-bar. It's not measuring purely the Warp3D/Nova performance; results are skewed by the time taken to render the info-bar text. 

PCIe bandwidth also explains why the A1-X5000s results are higher; it has a pretty good PCIe controller with higher throughput for CPU-based transfers.

The Actual Performance (Minus Info-Bar Skew)

Alain includes Cow3D's source-code, so I built a custom version that doesn't render the info-bar (but does output the fps to the console). Here's the result:

Hardware Warp3D Warp3D Nova Boost
A1-X1000 + Radeon HD 7770 218 fps 1396 fps 6.4x

Boom! Just like that, the performance boost more than doubles (as does the base Warp3D fps). The info-bar was skewing results quite a bit.

I'd love to see what you get on your hardware, so you can download the modified Cow3D here: Cow3D6-NoInfoBar.lha. Please post your results in the comments below.

So, There's No Point in Getting A High-End Card?

I've seen a few people jump to this conclusion, and want to make a case for high-end cards. Sure, high-end cards won't make much difference for Cow3D, but that's just one test. If Alain doubled the number of vertices in the cow then lower-end cards would probably hit their vertices/s limit. At that point Cow3D's fps would be more GPU dependent. In fact, if you had a really low-end GPU (e.g., a Radeon R7 240) then you'd already notice lower fps than with better cards. Now, let's say we double the number of vertices again... or used complex shaders... or increased the resolution... or... I hope you get my point.

The bottom line is: Cow3D and other benchmarks only give you the performance under specific conditions. Getting 100 fps with one game engine won't tell you how well another game will perform. This is why benchmarks like 3DMark run a suite of tests, each probing different limits/conditions.

When more Warp3D-Nova/OpenGL-ES-2 software is released (and there are some in the works), it's highly likely that some will work better on higher-end graphics cards. You're welcome to buy the cheapest card you can find, but you may find that card limiting in future. Of course, I have no idea when that day will come...