Tuesday, September 11, 2012

...more tricks, more speed!

Very recently, I've further optimized the VGA smooth scaling that DSx86 uses, and Pate already announced on his blog the next release will feature the faster code. If you haven't read about the first wave of optimizations, please read about it in my August posts.

This time I exploited two more tricks, and the result is that the speed increased from 79% faster to 114% faster compared to the original code. Of course, this isn't bad at all.

The former optimization comes from the observation that there's no way to use a base register plus a shifted register offset addressing scheme when accessing halfwords, whereas this is very common when accessing words instead. This means we need a separate shift instruction to calculate the offset if we want to access a halfword in a lookup table while, on the contrary, we can access a word in a lookup table using a single instruction. Thus, if we define a 256-words temporary space in the stack (it takes 1 KB) for the lookup table and copy there each palette RGB values as whole words, we will later save one instruction per input pixel when we access them. So this fragment of code:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
lsl r3, #1         @ calculate offset (1st pixel)
ldrh r3, [r11, r3] @ read first pixel RGB color
lsl r4, #1         @ calculate offset (2nd pixel)
ldrh r4, [r11, r4] @ read second pixel RGB color
lsl r5, #1         @ calculate offset (3rd pixel)
ldrh r5, [r11, r5] @ read third pixel RGB color

turns into this shorter one:

ldrb r3, [r1], #1  @ read first pixel value
ldrb r4, [r1], #1  @ read second pixel value
ldrb r5, [r1], #1  @ read third pixel value
ldr r3, [r11, r3, lsl #2] @ read first pixel RGB color
ldr r4, [r11, r4, lsl #2] @ read second pixel RGB color
ldr r5, [r11, r5, lsl #2] @ read second pixel RGB color

Since the code performs 5 lookups per loop in total, this optimization saved 5 instructions, shortening the whole loop to 27 instructions only and increasing the speed to 98%.

The latter optimization done uses the very well known trick of the loop unrolling. Since the results are so good, I thought it was worth spending some code space. The loop has been unrolled 8 times, thus now it processes 40 input pixel each iteration before encountering the costly (3 cycles) branch instruction. Even this simple optimization proved to be very effective in terms of performance improvement.

Monday, September 03, 2012

Hardware generated smooth scaling

In my August posts I focused on the improvements done with DSx86's ARM ASM smooth scaling routine where I did my best to make it as fast as possible, knowing that every CPU cycle saved there would turn useful in the emulation main loop. Then it took me a few more months to realize that actually the same result can be achieved by properly programming the NDS 2D graphical core. So here's how I did it.

The smooth scaling routine takes groups of five 256-color pixels on the same line and turns them into four 32K-color pixels on the DS screen by performing many palette lookups and regular/weighted averages, as we've seen already. The DS 2D core, on the other hand, can perform alpha blending between two backgrounds, without requiring any effort from the CPU. This alpha blending feature can achieve nothing less than an average between each pixel of the first background and the corresponding pixel on the second background, returning a 32K-color image.(1) Additionally, the 2D core can also perform background scaling. We need to exploit both these features.

Let's define the 5 original pixels as p0-p4, and the resulting 4 output pixels as r0-r3. What we need to get is:

r0 as the sum of 3/4 p0 and 1/4 p1
r1 as the sum of 1/2 p1 and 1/2 p2
r2 as the sum of 1/4 p2 and 3/4 p3
r3 as p4

If we could blend 4 backgrounds together we could simply copy specific pixels in the 4 backgrounds to obtain this (please check that each output pixel is exactly as expected):

BG0: p0 p1 p2 p4
BG1: p0 p1 p3 p4
BG2: p0 p2 p3 p4
BG3: p1 p2 p3 p4

Since the 2D core can do background scaling, we don't even need to copy specific pixels. Each background can be generated the way we need it starting from the unmodified original image stored in Video RAM using the scaling features. Thus, we program the 2D core to skip one source pixel each group of five, and choose which pixel has to be skipped.
For example, to generate each of the backgrounds (the code does that for BG2), we have to program the background affine matrix to scale a 320-pixel wide image in a 256-pixel wide background:

REG_BG2PA = (320 << 8) / 256;
REG_BG2PB = 0;
REG_BG2PC = 0;
REG_BG2PD = (1 << 8);


Then we should tell the 2D core to skip pixel p1. This is accomplished by using the reference point X coordinate register:

REG_BG2X = (3 << 8) / 4;

You can think of this register as if it was a sort of a counter of the fractional part. We initialize it to a precise value (3/4, in this case) and after each output pixel has been generated, 1/4 gets added to this counter. (It's because 320 divided by 256 gives 1 plus a fractional part of 1/4). When the counter reaches the unit, the scaling process skips one pixel of the original image, and in this case this will happen after processing one pixel. We can also tell the 2D core to use the same 320x200 bitmap for all the backgrounds, then program different reference point X coordinate values for each background.

Unfortunately, what we can't ask the 2D core is to blend all 4 backgrounds at the same time. However, we can make it blend 2 of these backgrounds each frame, and blend the other 2 backgrounds the next frame, at 60 frames per second.(2) The LCD screen and our retinas will average the 2 generated images, providing in fact the expected result.

DSx86 actually uses a slightly different implementation. It performs vertical scaling at the same time (200 lines down to 192 in VGA "Mode 13h" and 240 lines down to 192 in VGA "Mode X", using different affine matrices) in the so-called 'Jitter' mode.

(1) The DS screen output supports 18bpp color, and alpha blending is probably performed with even more precision.
(2) Since only BG2 and BG3 support bitmap backgrounds, the code will blend these two, redefining them as needed on each frame.