When you reach the top, keep climbing
Posted by Jim Leonard on March 15, 2011
(Rather than break up the discussion, I’ve edited this entry with the promised timing information at the end of the post.)
First off, you owe it to yourself to check out Paku Paku, the astonishingly great pac-man clone written by Jason Knight. Why astonishingly great? Because, as a hobbyist retrogaming project, it does everything right:
- Uses a 160×100 16-color tweakmode on CGA, PCjr/Tandy, EGA, VGA, and MCGA, despite only VGA being capable of a truly native 160×100 resolution
- Plays multi-voice sound and music through the PC speaker, Tandy/PCjr 3-voice chip, Gameblaster CMS, and Adlib (yes, CMS support!)
- Runs on any machine, even a slow stock 128K PCjr
- Has convincing game mechanics (ghosts have personalities, etc.)
- Comes will full Pascal+ASM source code
This is just as good a job, if not better, than I like to do with my retroprogramming stunts. Very impressive work!
One of the things I love about coding for the 8088/8086 is that all timings and behavior are known. Like other old platforms like the C64, Apple II, ZX Spectrum, etc. (or embedded platforms), it truly is possible to write the “best” code for a particular situation — no unpredictable caches or unknown architectures screwing up your optimization. Whenever I see a bit of 808x assembly that I like, I try to see if it can be reworked to be “best”. I downloaded Paku Paku just as much for the opportunity to read the source code as for the opportunity to play the game (which I did play, on my trusty IBM 5160).
On Mike Brutman’s PCjr programming forum, a discussion of optimizing for the 8088 broke out, with Jason giving his masked sprite routine inner loop as an example of how to do things fast:
lodsw mov bx,ax mov ax,es:[di] and al,bh or al,bl stosw
It takes advantage of his sprite/mask format by loading a byte of sprite data and a byte of the sprite mask with a single instruction, then it loads the existing screen byte, AND’s the sprite mask out of the background, OR’s the sprite data into the background, then writes the background data. It takes advantage of many 808x architecture quirks, such as the magic 1-byte LODS and STOS instructions (which read a word into/write a word out of AX and then auto-increment the SI or DI registers, setting up for the next load/store) , and the 808x’s affinity for the accumulator (AX, for which many operations are faster than for other registers). In the larger function, it’s unrolled, specialized for the size of the sprite. It’s pretty tight code.
However, one line (“MOV BX,AX”) bugged me, as it also bugged the author:
The sprite data format is stored as byteMask:byteData words which I point to with DS:SI for LODSW… which I then move to BX (which sucks, but is still faster than MOV reg16,mem; add SI,2) so I can use bh as the mask and bl as the data.
So, was that code “best”? Is there no faster way to write a masked sprite in 160×100 tweaked text mode on the 8088?
First, let’s look at his original code, with timings and size:
lodsw 16c 1b mov bx,ax 2c 2b mov ax,es:[di] 10c 3b and al,bh 3c 2b or al,bl 3c 2b stosw 15c 1b -------------------------- subtotal: 49c 11b total cycles (4c per byte): 93 cycles
On 8088, reading a byte of memory takes 4 cycles, whether it’s “MOV AX,mem” or the MOV instruction opcode itself. That’s why smaller slower code can sometimes win over larger faster code on 808x. So it’s important to take the size of the code into account when optimizing for speed.
Some background knowledge of how Paku Paku works can help us: The game does all drawing to an off-screen buffer that mirrors the video buffer, and when the screen needs to be updated, only the changed memory is copied to the video buffer. Because Jason does all drawing to an off-screen buffer in system RAM, and the video buffer is smaller than the size of a segment, you have room left over in that segment to store other stuff. So if you store your sprite data in that same segment after where the video buffer ends, you can get DS to point to both screen buffer AND sprite data. Doing that lets us point BX to the offset where the sprite is (it was originally meant to be an index register after all), and use the unused DX register to hold the sprite/mask. We can then rewrite the unrolled inner loop to this:
mov dx,[bx] 8+5=13c 2b ;load sprite data/mask lodsw 16c 1b ;load existing screen pixels and al,dh 3c 2b ;mask out sprite or al,dl 3c 2b ;or sprite data stosw 15c 1b ;store modified screen pixels inc bx 3c 2b ;move to next sprite data grouping -------------------------- subtotal: 53c 10b total cycles (4c per byte): 93 cycles
Although we saved a byte, it’s a wash — exactly the same number of cycles in practice. However, since he is already unrolling the sprite loop for extra speed, we can change INC BX to just some fixed offset in the loop every time we need to read more sprite data, like this:
mov dx,[bx+1] (next iteration) mov dx,[bx+2] (next iteration) mov dx,[bx+3]
By adding a fixed offset, we can get rid of the INC BX:
mov dx,[bx+NUM] 12+9=21c 3b ; "NUM" being the iteration in the loop at this point lodsw 16c 1b and al,dh 3c 2b or al,dl 3c 2b stosw 15c 1b ---------------------------- subtotal: 58c 9b total cycles (4c per byte): 94 cycles
We shaved two bytes off of the original, but we’re one cycle longer than the original. While the smaller code is most likely faster because of the 8088’s 4-byte prefetch queue, it’s frustrating from a purely theoretical standpoint.
Reverse-engineer extraordinaire Andrew Jenner thinks two steps ahead of me and provides the final optimization that not only gets the cycle count down, but frees up two registers (DX and SI) in the process. He writes only what is necessary, and since we need to skip over every other byte when writing in 160×100 mode, manually updates the DI index register to do so. The end result is obtuse to look at, but undeniably the fastest:
mov ax,[bx+NUM] 12+9=21c 3b ; “NUM” being the iteration in the loop at this point and al,[di] 9+5=14c 2b or al,ah 3c 2b stosb 11c 1b inc di 3c 1b ---------------------------- subtotal: 52c 9b total cycles (4c per byte): 88 cycles
…successfully squeezing blood from a stone.
Is this truly “best”? I think so. But to prove it, we have to time the code running on the real hardware. Thanks to Abrash’s Zen Timer, we have the following results:
- Jason’s original code as listed above, repeated three times to plot a 5×5 sprite: 48 microseconds
- My code block, three times with [bx], [bx+1], [bx+2]: 41 microseconds
- Andrew’s optimization, also written with [bx], [bx+1], [bx+2]: 37 microseconds
And just to make your head spin, check the comments for this entry — the resulting discussion shows that if you’re willing to rearrange both your sprite data and your thinking, you can get things even faster!