The Icon Bar: Adventures in Optimisation

Continuing in my quest to optimise my RISC OS port of NBlood, I've recently been taking a look at the rendering code. Specifically, the top functions in my profiler's output: the vertical line drawing routines, which are used to draw walls and sprite-based entities.

The initial code

Profiling at 1280x1024 resolution on a BB-xM¹ showed that for a typical level, there are two functions taking up most of the CPU time: vlineasm4 and mvlineasm4. These are used by the sprite & wall rendering routines, which render the polygons as a series of vertical strips. Almost any programmer who's worked on software rendering will know that vertical rendering is generally bad for performance². But for DOOM-like game engines with texture-mapped walls, vertical rendering can offer some advantages³, which most likely explains its use within the BUILD engine.

To try and offset some of the performance impact of vertical rendering, the two routines already make use of a few optimisations:

Textures are stored in memory rotated 90 degrees, to optimise cache efficiency of the column-based rendering
The routines draw four columns at once: The top row of four pixels, followed by the second row of four pixels, followed by the third, etc.
There's a CLASSIC_SLICE_BY_4 #define which causes the inner loop to be unrolled 4 times
There's also some support for SIMD vector math, although this is only used for simplifying a 4-way addition operation (for advancing the texture coordinates of all four columns). I'm also not sure if any ARM targets are able to make use of theese optimisations.
And as the asm part of the name applies, there are x86 and x64 assembler versions of the routines available (but no ARM, which is one of the reasons why the ARM performance is so poor)

The pixel and block render order of vlineasm4-based wall rendering

For the purpose of this article, rather than looking at vlineasm4 or mvlineasm4 directly, it's best to start with the much simpler vlineasm1 routine, which renders a single column at a time. For the case where the texture height is a power-of-two (logy!=0) the inner loop is nice and simple, and easy to convert to ARM code:

loop
  MOV     temp, vplc, LSR logy  ; Divide down the single, fixed-point texture coordinate
  LDRB    temp, [buf, temp]     ; Perform texture lookup within the texture column
  LDRB    temp, [pal, temp]     ; Perform palette lookup (used for lighting)
  STRB    temp, [p], ourbpl     ; Store pixel to screen and advance to next row
  ADD     vplc, vplc, vinc      ; Advance texture coordinate
  SUBS    cnt, cnt, #1          ; Decrement loop counter
  BGT     loop                  ; And go round again

A trivial conversion of vlineasm1's inner loop to ARM assembler

Note that all rendering is performed in 256 colours, and each column (potentially) uses a unique palette (to allow for smooth lighting changes across the wall). Texture size is arbitrary, so the LSR logy can't be performed as part of the LDR instruction (LDR/STR doesn't support register-specified shift amounts).

The difference between the vline routines and the mvline routines is that the mvline routines deal with masked textures: If a texel has the value 255, it's transparent, and rendering is skipped.

Extra complexity comes in the tvline routines, which support translucent textures using a 256x256 lookup table to blend each texel into the screen. These support two different ways of indexing the lookup table: with transmode == 0 the texel provides the column index and the screen provides the row index. With transmode == 1 the two indices are swapped around. These translucent textures are only used infrequently, but to avoid big framerate drops when the player moves close to such a texture it's still important to make sure they get optimised as well.

And if that wasn't enough, the engine also supports textures with non-power-of-two heights; these are typically used by sprite entities like trees and enemies.

The goal

Although my initial profiling was done on a BB-xM, my goal is actually to try and improve performance on the Raspberry Pi 1, where it looks like achieving playable performance at 640x480 should be possible.

The obvious problems

Before starting to work on optimising the code, there were a few obvious problems that I was already aware of:

Rendering four columns at a time requires 20 variables to be active in the inner loop (not including temporaries) - far too many for a 32bit ARM register file.
GCC's code generation isn't doing a very good job - the stack is being accessed far more frequently than is required
The CLASSIC_SLICE_BY_4 optimisation is just a simple loop unrolling: The four pixels in the first row are processed before the four pixels in the second row, so unless the compiler reorders the operations, the code will have to load new buf, pal, vplc and vinc values after each and every pixel

To determine the performance impact of vertical rendering vs. horizontal, one of the earliest tests I did was to take vlineasm1 and change the code so that ourbpl was 1 instead of the correct value. For the Pi 1, this increased the performance from 14 megapixels per second to 47. If I can achieve those kinds of numbers, that should be a major boost in performance (at 640x480 on a BB-xM, vlineasm4 and mvlineasm4 were accounting for around 30% of CPU time).

The final code

vline routines

Being the simplest of the bunch, the (power-of-two) vline routine was where I started, and where I probably spent most of my time. After writing a few different routines, and exploring a few dead-ends, and eventually seeing sense, I settled on a new routine that's basically just a 16-wide version of vlineasm4:

There's still 4-way loop unrolling, but the processing order is changed so that only one column is processed at a time, reducing the number of times that the stack needs to be accessed to grab column parameters.
Parameters for each column are stored in an optimal manner, to allow them to be loaded with a single LDM (unlike the 4-wide C routines, which used separate arrays for each parameter)
Processing of pixels is interleaved; there are just enough registers spare to allow three pixels to be processed in parallel, which is perfect for hiding the three-cycle result latency that the ARM11 has for LDR instructions.
The stack is used as a scratchpad for building the 16x4 pixel block. This turned out to be faster than writing directly to the framebuffer - the ARM11 data cache only supports read allocation, so writes which miss the cache will go straight to main memory. If the code was writing to memory in linear order that might have been fine, but when skipping between multiple screen rows there's only so much the write buffer can do to merge the requests. (Note that in current versions of NBlood, the screen is actually an off-screen buffer in normal RAM, so the fact that RISC OS typically maps screen memory as non-cacheable is irrelevant)
Going 16-wide resulted in better performance than the 8-wide and 4-wide versions that I also tried. The ARM11 only saw a modest improvement in performance when switching from 8-wide to 16-wide, but the Iyonix, which I also tested on, saw a significant improvement: 35.5 cycles/pixel down to 25.5. The screen pointer is also quadword aligned (making it perfect for a 16 byte cache writeback granule), although I guess that's only likely to matter for CPUs with write-allocating caches.

The pixel and block render order of the new renderer (showing 4-wide columns for simplicity)

Since the old code only supported rendering 4-wide strips (and 1-wide columns for any leftover bits), I also had to rewrite those routines to cope with the new 16-wide code. The new structure is:

Calculate the parameters for all the vertical lines and store them in a big array, in the right format for the assembler code. There's also a secondary array for storing the top & bottom Y coord of each line.
Call a generic rendering routine (vlineblock), passing in a function pointer to the low-level assembler routine. This higher-level routine will break down the vertical line list into chunks which the assembler routine can process.

To try and squeeze out as much performance as possible, each assembler function is actually composed of five smaller routines, which cater for the following situations:

A 1-wide column of arbitrary height & alignment
A column of 2x4 blocks, halfword aligned
A column of 4x4 blocks, word aligned
A column of 8x4 blocks, doubleword aligned
A column of 16x4 blocks, quadword aligned

There's some more work which could be done to improve vlineblock so that it will render in bigger blocks where possible (e.g. when dealing with the sloping top & bottom of a 16-wide column), but that can wait for another day.

One downside of having the assembler routines cope with so many different situations is that it makes the code rather large - each routine is over 3KB in size, which is sure to hurt I-Cache efficiency in some cases. It's also important to note that most of that code is generated by the use of macros - there's a macro for processing a 1x4 pixel strip which I just instantiate as many times as necessary to produce the core of each of the Nx4 routines.

mvline routines

For mvline I found that using a scratchpad on the stack added too much extra overhead. Instead it's more efficient to just conditionally write each byte directly to the screen.

Since the code is writing directly to the screen a byte at a time this means there's no need for vlineblock to align the screen pointer to the block width - which looks like it yields a small performance improvement, due to increased opportunities for the wider block sizes to be used.

tvline routines

Like mvline, for this I found that the most efficient method was to just write directly to the screen, one byte at a time. However I'll admit that I didn't try very hard to find alternatives, as I was starting to tire of working on the code!

Unlike mvline it looks like aligning the screen pointer to the block size yields the best performance - probably because (on ARM11) it does mean that the largest block size will match the cache writeback granule.

The original C version of tvline dealt with transmode on a per-pixel basis. This is obviously inefficient (the flag will have the same value for the entire wall), but because the assmebler routines are already a bit on the big side I decided it would probably be best to do the same in order to avoid putting too much pressure on the I-Cache. By loading transmode into the carry flag I found that it was possible to deal with it on a per-pixel basis with the only overhead being one extra instruction per pixel.

Non-power-of-two

For power-of-two textures, the texture coordinate is a 32bit fraction in the range [0,1). The 32bit width is important, to allow the coordinate to automatically wrap when it overflows.

For non-power-of-two textures, there are two possible methods:

Keep the texture coordinate as a 32bit fraction, and convert to a byte offset by performing a wide multiply with the texture height. E.g. use UMULL and discard the lower 32 bits of the result (which will just be the fractional position within a texel)
Change the range of the texture coordinate from [0,1) to [0,height) and represent it as a fixed point number, e.g. 16.16. This will avoid the need to use a multiply instruction, but will require one extra instruction (and use of the PSR) to deal with texture wrapping⁴.

When writing the code, I focused trying to find a multiply-based solution, mainly because I hadn't yet realised the other solution would only require one extra instruction. With a few different multiply instructions available, I eventually settled on the ARMv5E SMLATT instruction:

Ordinary MUL would require an extra instruction to shift the normalised texture coordinate down (e.g. from 32 bit fraction to 16 bit) before the multiply - but I was keen to find a solution which wouldn't require any extra instructions.
UMULL performs a full 32x32 -> 64bit multiply, which makes it quite slow. It also requires an extra register to store the low half of the result (which we aren't actually interested in)
SMULxy performs a 16x16 -> 32bit multiply, with the useful ability to specify whether each parameter is taken from the top or bottom of the corresponding source register. This avoids the problem MUL had where an extra shift instruction would have been needed. Also because it only performs a 16 bit multiply, on some CPUs it's faster than an ordinary MUL (and definitely faster than UMULL). However, there's one problem: input texture coord is meant to be an unsigned number, but SMULxy performs a signed multiply, messing up the result for any texture coord >= 0.5 (and there's no unsigned version available).

For even-height textures I was able to solve the SMULxy signed multiply problem relatively quickly: just bias the texture coordinate by &80000000 (i.e. 0.5), so that the coordinate range is [-0.5,0.5). A complementary bias (of half the texture height) can then be applied to the texture pointer, so that the [-height/2,height/2) output of the multiply gets mapped back to [0,height).

But for odd-height textures the above approach doesn't work - an extra bias of half a texel is required. And for this to work properly, it must be applied after the multiply, not before. But after the multiply, the next instruction is the LDRB which loads the pixel - there's no way of adding a sub-texel offset to that. This is where SMLAxy comes in - it's a multiply-and-accumulate variant of SMULxy. The multiply result gets added to a 32bit accumulator, allowing the sub-texel bias to be applied for free.

Except, it isn't quite free, because now I need an extra register to store the half-texel bias value. Luckily SMLAxy came to my rescue again - I can pack the texture scale/size into the top half of a register, and the half-texel bias into the bottom half. Because they're both packed into the same register and the accumulator input is a 32bit value, it does mean that the texture scale/size from the top half of the register also gets added into the multiply result. But it's easy to counter that just by having the setup code subtract the texture height from the texture pointer.

So apart from needing to redo some of my instruction scheduling (SMLAxy has a higher result latency than MOV), I was able to add non-power-of-two support to the routines without adding any extra instructions or having to find a way of freeing up an extra register.

loop
  SMLATT  temp, vplc, magic, magic   ; Convert from normalised texture coord
  LDRB    temp, [buf, temp, ASR #16] ; Perform texture lookup within the texture column
  LDRB    temp, [pal, temp]          ; Perform palette lookup (used for lighting)
  STRB    temp, [p], ourbpl          ; Store pixel to screen and advance to next row
  ADD     vplc, vplc, vinc           ; Advance texture coordinate
  SUBS    cnt, cnt, #1               ; Decrement loop counter
  BGT     loop                       ; And go round again

Basic non-power-of-two rendering via SMLATT

Profiling results

Presented below are the profiling results for my final routines, some facsimilies of the original routines, and some of the inbetween experiments. Because I've tested on a wide variety of machines with different CPU speeds, I've decided to present the results in terms of CPU cycles per pixel. This will make it clearer where performance is being affected by the CPU architecture, rather than by clock speeds (although memory speed can still have an effect).

Results are shown for all CPUs where the number of CPU cycles can be easily measured - in practice this is the Iyonix and every machine released afterwards. Chances are I could have cobbled together something to measure StrongARM RiscPC performance, but since running NBlood on a RiscPC isn't feasible I decided it wouldn't be worth the effort (and the SMLATT-based non-power-of-two rendering wouldn't work anyway).

The basics

First, a few tables showing my initial experiments with the vline routines. While writing up this article I've realised that the data I gathered for these routines isn't particularly great for drawing comparisons, but it should be good enough to show some of the basic issues.

CPU	vlineasm1_reference	vlineasm4_reference	Improvement
XScale	187.6	86.1	2.18x
ARM11	49.7	43.6	1.14x
Cortex-A7	68.7	23.4	2.94x
Cortex-A8	119.3	33.1	3.6x
Cortex-A9	17.8	16.9	1.05x
Cortex-A15	18	11.7	1.54x
Cortex-A53	46.5	16.8	2.77x
Cortex-A72	18.3	9.8	1.87x

Here we can see the relative performance of a 1-wide vertical line routine and a 4-wide vertical line routine; these are C implementations which should be equivalent to the non-CLASSIC_SLICE_BY_4 NBlood routines. All CPUs see a performance improvement when switching from 1-wide to 4-wide, some of them a very significant one. However it's important to realise that the test framework for this wasn't using them to draw a complete wall - it was only using them to draw individual 1-wide or 4-wide columns, each with completely random parameters. So the performance seen here isn't very representative of the wall rendering performance.

CPU	vlineasm1_reference	vlineasm1_wrap	Improvement
XScale	187.6	182.6	1.03x
ARM11	49.7	48.9	1.02x
Cortex-A7	68.7	15.5	4.43x
Cortex-A8	119.3	24.8	4.81x
Cortex-A9	17.8	16.7	1.07x
Cortex-A15	18	16.6	1.08x
Cortex-A53	46.5	11.3	4.12x
Cortex-A72	18.3	11.7	1.56x

Here we have the performance of individual 1-wide vertical lines (vlineasm1_reference) versus walls drawn from 1-wide vertical lines (vlineasm1_wrap). As you can see, this results in a big performance improvement to around half the CPUs, and a minor performance improvement to the rest. A quick look at the relevant docs suggests that we're dealing with three basic types of CPU: in-order with read-allocating caches (XScale, ARM11), in-order with write-allocating caches (A7, A8, A53), and out-of-order with write-allocating caches (A9, A15, A72). It's the CPUs with write-allocating caches which see the big improvements, suggesting that they're seeing improved performance due to better cache utilisation. However the out-of-order CPUs see much smaller improvements - suggesting that maybe they're capable of hiding most of the stalls caused by vlineasm1_reference's poor cache utilisation. A closer look at the full performance counter data might be enough to reveal what's really going on.

CPU	vlineasm1_reference	vlineasm1_linear	Improvement
XScale	187.6	10.9	17.21x
ARM11	49.7	15.4	3.23x
Cortex-A7	68.7	11	6.25x
Cortex-A8	119.3	8.4	14.2x
Cortex-A9	17.8	6.7	2.66x
Cortex-A15	18	4	4.5x
Cortex-A53	46.5	9.9	4.7x
Cortex-A72	18.3	4.3	4.26x

In an attempt to determine the theoretical maximum performance, I made a simple change to the vertical line routine: Make it render a horizontal line instead of a vertical one. Even though the code is still only writing a single byte at a time, the fact that it's doing it in a simple horizontal strip means that the CPU is easily able to coalesce the writes in the write buffer. So all the tested CPUs, even those which only support read-allocation, will see significantly improved performance. The code still isn't fully optimised (it's just a simple assembler loop with no unrolling, and little consideration for pipeline stalls), but it provided me with a good target to aim for.

Blocks

The next step I took was to experiment with different block sizes, and methods of rendering those blocks (direct to screen, order of pixels within blocks, etc.). Again, the data I've gathered here isn't particularly great, but it's good enough to show where improvements were found.

CPU	vlineasm1_wrap	rect1asm	Improvement
XScale	182.6	253.6	0.72x
ARM11	48.9	81.8	0.6x
Cortex-A7	15.5	19.6	0.79x
Cortex-A8	24.8	20.1	1.23x
Cortex-A9	16.7	37.4	0.45x
Cortex-A15	16.6	19.6	0.85x
Cortex-A53	11.3	16.4	0.69x
Cortex-A72	11.7	10	1.17x

After building a list of vertical line segments describing a wall, the performance of rendering each vertical line in turn (vline1asm1_wrap) versus rendering one screen row at a time (rect1asm). Conclusion: Don't do that.

rect4x4 variants

Here are the results for a number of different routines which render rectangles by breaking them down into 4x4 blocks. vlineasm1_wrap (baseline) and vlineasm1_linear (theoretical limit) are both included for reference. Note that the chart is using a logarithmic scale. The different rect4x4 routines are:

rect4x4 - C routine which builds a 4x4 pixel block on the stack before performing a (word-based) copy to the screen. The code is written to process the data one column at a time, although I haven't checked whether the compiler is reordering things behind my back. The blocks are built up into the full rectangle by having the outer loop step through the rows, while the inner loop steps through columns.
rect4x4asm - rect4x4, but converted to assembler. Some basic interleaving of operations is performed, so two pixels are being processed at once.
rect4x4asm3 - rect4x4asm, but with better interleaving so that three pixels are processed at once. This is close to optimal interleaving for ARM11.
rect4x4asm3vert - rect4x4asm3, but the outer loop processes columns, while the inner loop processes rows.
rect4x4asm4 - rect4x4asm3, but writing direct to screen instead of using a temp buffer on the stack.

As can be seen, rect4x4asm3vert generally gives the best performance out of the different rect4x4 routines, sometimes reaching (or beating) the "theoretical maximum" vlineasm1_linear.

Different block sizes

Variants of rect4x4 using different block sizes. The "vert" versions have the outer loop process columns while the inner loop processes the NxM blocks within those columns, like rect4x4asm3vert. Although different machines see different results, there's the general trend in the non-vert routines that making them wider improves performance, but the best increase is from making them taller. With the "vert" routines, the general rule seems to be that "smaller is better", except on XScale where wider is better. However I think it's safe to blame most of this on GCC's poor code generation, because the assembler versions of the routines tell a different story:

C versus assembler

The C rectNx4vert routines and their assembler versions, along with the "theoretical limit" vlineasm1_linear. As you can see, the assembler versions are much better than GCC's efforts, and rect16x4asm1vert in particular is able to beat vlineasm1_linear on most CPUs. Note that all the assembler routines use the same basic instruction scheduling as rect4x4asm3vert.

CPU	vlineasm1_linear	vblockasm
XScale	10.9	28.6
ARM11	15.4	12.9
Cortex-A7	11	9.5
Cortex-A8	8.4	9.5
Cortex-A9	6.7	7.2
Cortex-A15	4	4.1
Cortex-A53	9.9	6.5
Cortex-A72	4.3	3.7

Combining rect4x4asm3vert, rect8x4asm1vert, rect16x4asm1vert, and a few other routines (for 2x4 blocks and 1-wide vertical lines) yields vblockasm, my final optimised version of the (non-power-of-two) vlineasm routine. The performance figures given above are for arbitrary walls, so will be exercising all the different code paths. As you can see, the performance is still reaching (or beating) the "theoretical maximum" vlineasm1_linear on all but XScale.

Now add masking

With the basic wall rendering tackled, my next task was to add masking support. This gave lots of scope for experimentation with different techniques. For these tests, I've been using (random) textures which are aprroximately 50% transparent, with the solid/transparent areas clumped together in blocks - a reasonable approximation for the kinds of textures that are used in Blood.

Masking experiments

These are all assembler routines.

mblock1asm1 - 1-wide vertical lines, rendered straight to screen, with ARM11-optimised loop unrolling / scheduling. A simple baseline to compare against.
mblock4x4asm1 - Rendering in vertical columns of 4x4 blocks, using the stack as a temporary buffer (as per rect4x4asm3vert). However it also builds a mask word indicating which pixels within the block are transparent, which is then used to allow either (a) the whole block to be stored to screen (a word at a time) if all pixels are visible, (b) the whole block to be skipped if all pixels are transparent, or (c) conditional writes on a per-pixel basis for all other cases. Although the conditional code is about 50 instructions long, the fact that many blocks fall into either the "fully visible" or "fully transparent" case helps to make this one of the faster optimised routines.
mblock4x4asm2 - As per mblock4x4asm1, but the (c) case performs a word-based read-modify-write operation on the screen, using the ARMv6 SEL instruction to select which bytes to update. This makes the conditional case shorter, but also considerably slower.
mblock4x4asm3 - As per mblock4x4asm2, but using the more traditional AND, BIC, ORR approach for selectively updating pixels in each word. This gives roughly the same performance as mblock4x4asm2, suggesting that the performance issues with the two routines are down to stalls caused by reading from screen, and not down to the contant PSR updates (via MSR) required for the SEL instruction to be used.
mblock4x4asm4 - Like the other mblock4x4 routines, but conditionally writing each byte directly to screen instead of using a temporary buffer on the stack. This generally yields better performance than the other routines seen so far, most likely due to the shorter code sequence.

One other experiment I tried was to add some PLD instructions. However (on ARM11) this only resulted in making things slower. Checking the ARM1176JZF-S TRM reveals the cause - outside of secure state, PLD is treated as a NOP. Possibly PLD would have been useful on other CPUs, but since I'm mainly interested in maximising ARM11 performance I didn't check.

Ultimately, it's mvblock4x4asm4 which I used as the template for the final routine.

Masking block sizes

mblock8x4asm4 and mblock16x4asm4 show the effect of increasing the block width. mvblockasm is the final routine (coping with widths of 1, 2, 4, 8, 16, as per vblockasm). vblockasm is included for reference, showing that the final result is a routine that's generally only a little bit slower than the non-masked version.

Translucency

The worst-case for this routine is that every pixel rendered requires a read-modify-write of the screen. In the end, I only tried two optimisations.

Translucency experiments

tvlineasm1_reference - 1-wide vertical lines, rendering straight to screen, implemented in C. A simple baseline to compare against.
tvblockasm - The final code, using the same strategy as mvblockasm.
tvblock2_16x4asm - An experiment with storing the unblended texels in a temporary buffer on the stack. This allows the screen reads/writes to be deferred, resulting in fewer instructions being executed for 100% transparent areas, and more optimal screen memory access for the visible areas: The temp buffer is processed in 4x1 blocks, instead of the 1x4 blocks that tvblockasm effectively uses. The graph shows this as (mostly) being slightly faster than tvblockasm, but I believe direct comparison between the 16x4 processing of tvblockasm showed that tvblockasm was slightly faster (hence it making the final cut).

mvblockasm and vblockasm have also been included for reference.

Also notable is that tvblockasm deals with transmode inline, by using a couple of conditional ORR's to perform the two different array lookup calculations (keyed off of the carry flag, to allow Z to be used for the "is this pixel fully transparent?" test). Meanwhile tvblock2_16x4asm utilises two different code paths in order to select between the two different modes.

Final figures

Here are the performance figures for the full set of routines (in CPU cycles/pixel). As previously mentioned, the non-power-of-two routines are basically just slightly tweaked versions of the power-of-two routines, using SMLATT for texture coordinate generation, and with slightly tweaked instruction scheduling to account for the extra 1 cycle delay this causes on ARM11.

CPU	vblockasm	vblocknasm	mvblockasm	mvblocknasm	tvblockasm	tvblocknasm
XScale	28.6	27.6	33.1	29.6	76.3	71.1
ARM11	12.9	13.3	14.8	14.8	50.4	50.5
Cortex-A7	9.5	11.2	10.1	10.8	22.9	24.8
Cortex-A8	9.5	11.6	9.9	12.1	27.1	29.7
Cortex-A9	7.2	8.3	9.4	8.3	19.4	20.1
Cortex-A15	4.1	4.3	6.4	5.8	11.6	11.5
Cortex-A53	6.5	8	7.9	8.1	19.9	21.6
Cortex-A72	3.7	3.9	3.9	4.1	8.9	9.3

vblockasm - Basic wall rendering, power-of-two
vblocknasm - Basic wall rendering, non-power-of-two
mvblockasm - Masked wall rendering, power-of-two
mvblocknasm - Masked wall rendering, non-power-of-two
tvblockasm - Translucent wall rendering, power-of-two
tvblocknasm - Translucent wall rendering, non-power-of-two

As can be seen, the non-power-of-two routines only see a minor performance drop compared to the power-of-two routines (except on XScale, where they somehow manage to be faster).

Going back to NBlood on the Pi 1, with the new code installed there's been a fairly healthy performance improvement. In the opening areas of the first level the screen is now rendering in around 7ms (previously around 9-10ms), and the worst-case location of that level (looking through the cemetary gates) now takes around 11-12ms (previously 15-16ms).

What we've learned

GCC 4.7.4 isn't great at optimising routines where the inner loop requires more registers than are available (for ARM, at least)
GNU as's macro features are more advanced than I previously thought
Always write test code! I wasted several hours chasing shadows because lack of testing early on led to bugs being introduced in the code which resulted in misleading profiling results
Care needs to be taken when optimising write-heavy code for CPUs with non write-allocating caches
If you try hard enough, it's possible to turn a signed multiply instruction into an unsigned one
"CPU cycles per unit of work" is a good way of comparing how performance differs across multiple CPUs (expect to read more about that in a future article)

Footnotes

1280x1024 is probably a bit optimistic for a BB-xM, but for now it's a useful stress test to see which bits need optimising
Framebuffers are typically one contiguous block of memory, which is first sub-divided into rows, and then each row is sub-divided into the pixels within that row. Modern CPUs provide best memory throughput when memory is accessed in a linear fashion, which is what a row-based renderer will be doing: pixel (x,y) is directly followed in memory by pixel (x+1,y). But with column-based rendering, pixel (x,y) is hundreds or thousands of bytes away from pixel (x,y+1), severely impairing cache efficiency and triggering many more main bus accesses.
With vertical walls, and no ability to pitch or roll the camera, walls will always be projected to the screen as a trapezium with the two parallel edges being completely vertical. And because the textures are never rotated, all pixels within a screen column will come from a single texture column. This means that rotating the textures 90 degrees in memory and performing vertical rendering will greatly improve the cache efficiency of texture lookups (compared to horizontal rendering, where regardless of texture orientation, horizontally adjacent screen pixels are likely to have texture coordinates that differ in both the X and Y axis).

At first glance you might think it'll need two (or more) instructions to deal with coordinate wrapping, but if the data is prepared correctly it's possible to do it with just one:

  LDRB    temp, [buf, vplc, ASR #16] ; Perform texture lookup within the texture column
  ADDS    vplc, vplc, vinc           ; Advance texture coord and check for wrap
  ADDMI   vplc, vplc, vsize          ; Correct texture coord after wrap

If vinc is negative, nothing special is needed; just set vsize to the texture size (shifted left by e.g. 16 bits), and ensure abs(vinc) < vsize.
If the texture increment is positive, configure things so that the wrap occurs when vplc overflows from &7fffffff to &80000000. This will require you to bias the texture coordinate and texture pointer by the appropriate amount, and invert vsize (so the texture size will be subtracted instead of added). E.g. if you're using 16.16 fixed point, you'll want to add (&8000-abs(vsize))<<16 to the initial texture coordinate, and subtract &8000-abs(vsize) from the texture pointer.

Adventures in Optimisation - Walls