Taborizing Drivers & Libraries (a.k.a., Optimizing Code for the e500 SPE)

If you've been following the AmigaOS world then you'll no doubt have heard of the upcoming A1222 (codenamed Tabor). It's a nice little board; one that I'm planning to build a laptop out of. Anyway, the P1022 CPU on-board has a peculiarity: it's got a Stream Processing Engine (SPE) for floating-point calculations that's incompatible with the standard PowerPC FPU. While there will be a fast FPU emulator that uses the SPE, for some code it's worth compiling an SPE native version.

In particular, the graphics drivers I've written use the FPU (and/or altivec) to copy data to/from graphics memory. That's performance critical, and so should be optimized for the SPE (a.k.a., "Taborized"). Taborizing libraries isn't as straightforward as it sounds, and I encountered some challenges. To save you the time, here's everything you need to know if you have libraries you want to Taborize.

1. Compiling for the SPE

You'll need a version of GCC that supports the SPE. That is a version between 5 and 7, inclusive (they removed SPE support in newer versions). After that, it's a simple matter of setting the right command line options:

-mspe -mcpu=8540 -mfloat-gprs=double -mabi=spe

These will compile code for the SPE. Additionally, Daniel Müßener (GoldenCode) has given some extra parameters to avoid some compiler bugs (GCC 5.4.0 generated code causing structure element alignment issues):

-fno-inline-functions -fno-partial-inlining -fno-align-functions -fno-align-jumps 
-fno-align-loops -fno-align-labels -fno-inline-small-functions 
-fno-indirect-inlining

NOTE: This may or may not be necessary with the latest GCC. I'm publishing it just in case.

2. Compile Just the Critical Bits for the SPE

Recompiling the whole library for SPE sounds like the easiest solution. However, you probably can't because the Application Binary Interface (ABI) is different. PowerPC processors pass floating-point parameters to functions via the floating-point registers. Well, those registers don't exist on the P1022 (e500 core). Therefore, if your library/driver has public functions with floats/doubles, then compiling the entire library for the SPE would make it incompatible with regular code.

So compile just the critical bits for the SPE, and make sure that all functions called from external code (i.e., is part of the library's/device's API) is compiled for the regular PowerPC.

Often it's unnecessary to recompile the whole thing, anyway. In the W3D_SI driver's case, only the code generating vertices and writing them to VRAM needed optimizing, and it was already set up to use altivec optimized code on machines with altivec. So, just that code was compiled for the SPE. There was also no need for a separate SPE version of the driver, because the SPE optimized code sits nicely alongside the regular FPU and altivec code.

3. Forcing Correct Stack Alignment

Just when you thought you were done, a complication. Code generated for the SPE assumes that the stack is 16-byte aligned. Indeed, the PowerPC System V ABI specification says it should be so for all PowerPC code. Well, some programs don't respect this (e.g., OpenArena).

While regular PowerPC code doesn't mind, the SPE has instructions requiring an 8-byte alignment. And, yes, GCC uses those instructions in the function's prolog. So, when something like OpenArena misaligns the stack, it's "game over, man!"

The only solution is to force the stack to be correctly aligned when calling SPE code. GCC has an option to force stack alignment... but it's for x86/x64 only There's nothing for PowerPC. So, I've written a tool that will patch the generated assembly code...

PatchForSPE - A Tool To Realign the Stack

This tool takes assembly code generated by GCC, and then patches the function prologs/epilogues to realign the stack. You use it as follows:

Download it: PatchForSPE.cpp
Either compile it separately, or add a rule to your makefile. For example:
```
PatchForSPE: PatchForSPE.cpp
	$(CXX) -o $@ $<
```
Add the following to CFLAGS_SPE (needed for the pacher to work):
```
 -mno-regnames
```
Generate assembly code instead of the output binary, with the -S option, and run it through PatchForSPE (hint use -ggdb instead of -gstabs for debug info, or the GNU assembler might give warnings). An example makefile rule:
```
$(CC) $File_spe.S: File.c File.h PatchForSPE
      $(CC) -c $(CFLAGS) $(CFLAGS_SPE) -S -o $@_in $<
      PatchForSPE $@_in $@ 
```

Finally, compile the assembly file to an object file:

File_spe.o: $(CFG)/File_spe.S
    $(CC) -c $(CFLAGS) $(CFLAGS_SPE) -o $@ $<

NOTE: These instructions are in PatchForSPE.cpp too.

How's the Performance?

I'm a little hesitant about publishing benchmarks, because I'm still using the old exception-based pure software FPU emulator. So, the performance boost will NOT be comparable to the fast FPU emulator that's coming. Nevertheless, here's some data:

Program	SPE Native	Old W3D_SI (fps)	Taborized W3D_SI (fps)
Cow3d (no info bar)	No	12	~144
Wings Remastered - Bombing - Strafing	Yes	55 12	240-270 95-120
VoxelBird (medium detail)	Yes	<16	~60
VoxelNoid (no shadows)	Yes	~30	~70

IMPORTANT: The comparison is between an SPE optimized W3D_SI, and the old version running via the old slow exception-based FPU emulator. Final performance with the fast FPU emulator may be different.

Wings, VoxelBird and VoxelNoid were all compiled for the SPE in these tests, so those tests show near fully native SPE performance. Performance is similar to a Sam460ex with the same graphics card.

Summary

In brief, to optimize a library for running on the Tabor/A1222:

A version of GCC that supports the SPE (version 5.4.0 or newer should work, up to GCC v7.x)
Only compile bits for SPE that need optimizing (i.e., use the FPU a lot and are performance critical). Functions that have float/double parameters and are part of the library's API should NOT be compiled for the SPE (for compatibility with generic PowerPC code)
Use the PatchForSPE tool to force stack realignment before SPE code is executed

I hope this information is useful. If you discover anything new related to "Taborizing" code, then leave a comment below.

EDIT: 2023/07/27 - Updating based on feedback from Andrea Palmaté & Daniel Müßener.

5 Comments

Hans de Ruiter 27/07/2023 3:33am (9 months ago)

@Daytona675x

Thanks. I'll update the code and blog post accordingly.
Daytona675x 26/07/2023 11:34am (9 months ago)

Yo!

I found GPR 11 to be unsafe to be used for the calculation in the prologue.
I have situations where gcc generates code in which GPR 11 is preloaded in the prologue before the stwu to be later used by some isel commands. In that case your patcher leads to funny behaviour, of course.

I found GPR 12 to be safe, at least in my progs.
Sinan Gürkan 16/04/2018 10:07pm (6 years ago)

@Hans

No it doesn't crash..but nothing seems to happen after Spencer logo screen..
Probably as you said, data processing is a taking a long time..
Hans de Ruiter 14/04/2018 11:38am (6 years ago)

@Sinan

Yes, SPE optimized versions of the drivers are being worked on.

I'm surprised Spencer doesn't even start. If it's crashing, then that's something that needs to be looked at (although it will crash if it tries to use the existing MiniGL). If it starts and then appears to do nothing, then it may actually be working, but loading all the data is taking a long time. The existing exception-based FPU emulator is noticeably slow. For example, OpenArena takes a few minutes to load a level, and other games take a while to start up.
Sinan Gürkan 13/04/2018 4:08pm (6 years ago)

Dear Hans

Thanks for the information..Does that mean we will have SPE optimized versions of W3D_SI and W3D Nova soon from A-Eon?

Currently Freespace runs with 3 fps on Tabor while 3D Platform game Spencer doesn't even start...