If you've been following the AmigaOS world then you'll no doubt have heard of the upcoming A1222 (codenamed Tabor). It's a nice little board; one that I'm planning to build a laptop out of. Anyway, the P1022 CPU on-board has a peculiarity: it's got a Stream Processing Engine (SPE) for floating-point calculations that's incompatible with the standard PowerPC FPU. While there will be a fast FPU emulator that uses the SPE, for some code it's worth compiling an SPE native version.
In particular, the graphics drivers I've written use the FPU (and/or altivec) to copy data to/from graphics memory. That's performance critical, and so should be optimized for the SPE (a.k.a., "Taborized"). Taborizing libraries isn't as straightforward as it sounds, and I encountered some challenges. To save you the time, here's everything you need to know if you have libraries you want to Taborize.
1. Compiling for the SPE
You'll need a version of GCC that supports the SPE. That is a version between 5 and 7, inclusive (they removed SPE support in newer versions). After that, it's a simple matter of setting the right command line options:
-mspe -mcpu=8540 -mfloat-gprs=double -mabi=spe
These will compile code for the SPE. Additionally, Daniel Müßener (GoldenCode) has given some extra parameters to avoid some compiler bugs (GCC 5.4.0 generated code causing structure element alignment issues):
-fno-inline-functions -fno-partial-inlining -fno-align-functions -fno-align-jumps
-fno-align-loops -fno-align-labels -fno-inline-small-functions
NOTE: This may or may not be necessary with the latest GCC. I'm publishing it just in case.
2. Compile Just the Critical Bits for the SPE
Recompiling the whole library for SPE sounds like the easiest solution. However, you probably can't because the Application Binary Interface (ABI) is different. PowerPC processors pass floating-point parameters to functions via the floating-point registers. Well, those registers don't exist on the P1022 (e500 core). Therefore, if your library/driver has public functions with floats/doubles, then compiling the entire library for the SPE would make it incompatible with regular code.
So compile just the critical bits for the SPE, and make sure that all functions called from external code (i.e., is part of the library's/device's API) is compiled for the regular PowerPC.
Often it's unnecessary to recompile the whole thing, anyway. In the W3D_SI driver's case, only the code generating vertices and writing them to VRAM needed optimizing, and it was already set up to use altivec optimized code on machines with altivec. So, just that code was compiled for the SPE. There was also no need for a separate SPE version of the driver, because the SPE optimized code sits nicely alongside the regular FPU and altivec code.
3. Forcing Correct Stack Alignment
Just when you thought you were done, a complication. Code generated for the SPE assumes that the stack is 16-byte aligned. Indeed, the PowerPC System V ABI specification says it should be so for all PowerPC code. Well, some programs don't respect this (e.g., OpenArena).
While regular PowerPC code doesn't mind, the SPE has instructions requiring an 8-byte alignment. And, yes, GCC uses those instructions in the function's prolog. So, when something like OpenArena misaligns the stack, it's "game over, man!"
The only solution is to force the stack to be correctly aligned when calling SPE code. GCC has an option to force stack alignment... but it's for x86/x64 only There's nothing for PowerPC. So, I've written a tool that will patch the generated assembly code...
PatchForSPE - A Tool To Realign the Stack
This tool takes assembly code generated by GCC, and then patches the function prologs/epilogues to realign the stack. You use it as follows:
- Download it: PatchForSPE.cpp
- Either compile it separately, or add a rule to your makefile. For example:
PatchForSPE: PatchForSPE.cpp $(CXX) -o $@ $<
- Add the following to CFLAGS_SPE (needed for the pacher to work):
- Generate assembly code instead of the output binary, with the -S option, and run it through PatchForSPE (hint use -ggdb instead of -gstabs for debug info, or the GNU assembler might give warnings). An example makefile rule:
$(CC) $File_spe.S: File.c File.h PatchForSPE $(CC) -c $(CFLAGS) $(CFLAGS_SPE) -S -o $@_in $< PatchForSPE $@_in $@
- Finally, compile the assembly file to an object file:
File_spe.o: $(CFG)/File_spe.S $(CC) -c $(CFLAGS) $(CFLAGS_SPE) -o $@ $<
NOTE: These instructions are in PatchForSPE.cpp too.
How's the Performance?
I'm a little hesitant about publishing benchmarks, because I'm still using the old exception-based pure software FPU emulator. So, the performance boost will NOT be comparable to the fast FPU emulator that's coming. Nevertheless, here's some data:
|Program||SPE Native||Old W3D_SI (fps)||Taborized W3D_SI (fps)|
|Cow3d (no info bar)||No||12||~144|
|VoxelBird (medium detail)||Yes||<16||~60|
|VoxelNoid (no shadows)||Yes||~30||~70|
IMPORTANT: The comparison is between an SPE optimized W3D_SI, and the old version running via the old slow exception-based FPU emulator. Final performance with the fast FPU emulator may be different.
Wings, VoxelBird and VoxelNoid were all compiled for the SPE in these tests, so those tests show near fully native SPE performance. Performance is similar to a Sam460ex with the same graphics card.
In brief, to optimize a library for running on the Tabor/A1222:
- A version of GCC that supports the SPE (version 5.4.0 or newer should work, up to GCC v7.x)
- Only compile bits for SPE that need optimizing (i.e., use the FPU a lot and are performance critical). Functions that have float/double parameters and are part of the library's API should NOT be compiled for the SPE (for compatibility with generic PowerPC code)
- Use the PatchForSPE tool to force stack realignment before SPE code is executed
I hope this information is useful. If you discover anything new related to "Taborizing" code, then leave a comment below.
EDIT: 2023/07/27 - Updating based on feedback from Andrea Palmaté & Daniel Müßener.