Optimizing for the 8088 and 8086 CPU: Part 2
Posted by Trixter on January 11, 2013
Welcome back to our little crash course on how to optimize code for maximum speed on the 8088 and 8086 CPU. Let’s jump right back in with a quick way to transmogrify the contents of a buffer.
Lost In Translation
Everyone knows that lookup tables trade memory for speed; if you can calculate all possible values for a particular operation, then performing the “calculation” is reduced to retrieving a value from the table. Intel thought of this when they were designing instructions to help with strings and created XLAT, which stands for “translate”. XLAT replaces AL with whatever AL is referencing in a table pointed to by DS:BX, and it does it in a single byte and takes only 11 cycles to do so. So you can replace this:
XOR AH,AH MOV SI,AX MOV AL,[BX+SI]
Think of XLAT as “MOV AL,DS:[BX+AL]” (except that [BX+AL] is illegal on the 8088 and 8086.) Intel meant for XLAT to be “useful for translating characters from one code to another, the classic example being ASCII to EBCDIC or the reverse,” but nothing says you have to use it with strings. I’ve used it as a replacement for isolating nybbles, because the 8088 is slow at shifting if you have to do it more than a few times. For example, I can replace this:
MOV CL,4 SHR AL,CL
with just XLAT, with BX pointing to a table that has all 256 values shifted right by 4. While this burns 256 bytes for the table, the end result is a shift that finishes in half the time than Intel’s actual shift instruction. This has saved my bacon in inner loops.
Conditional branches (jumping to another location based on the result of a test) are a necessary evil of most programming languages. Unfortunately, they cost you in two ways. The first is the obvious cost — a jump takes some time to perform, which is never quick. The second cost is more subtle, but just as valid: When a jump is performed, everything in the prefetch queue gets jettisoned, because it is no longer the correct stream of instructions to be executed. You really want the prefetch queue to be as full as possible most of the time, because when it isn’t, the CPU has to wait around for its BIU to grab instructions. This is less of a problem on the 8088 because it is hard to keep the queue full, but on the 8086 where it’s easy to keep the queue full, this is a real penalty.
For those who need an example of why this is important, let’s say you were a bit of a nerd who loved to research everything you did (not really a stretch because you’re reading this guide right now!), and your girlfriend wanted to visit an art museum. To save time, you fetched a set of art books to read on the way there before it was time to leave. But just before you leave, your girlfriend changes her mind and wants to go to a baseball game instead. The art books you’re carrying now don’t apply to where you’re going, so you dump them and start fetching books on baseball instead, and now you’re both going to be late because it takes time to fetch books. Your girlfriend changing her mind is a conditional branch being taken, and dumping the first set of books because they don’t apply any more is emptying the prefetch queue.
Because of this property, you should always be thinking of ways to eliminate branches. This can range from obvious solutions down to decidedly non-linear thinking to come up with a clever solution. This is difficult to visualize, so let’s look at a few examples. Let’s say you needed to set a register value to either 0 if it is even, or 4 if it was odd. Here’s a straightforward way to do this:
TEST AL,1 ;test if first bit set JZ @even ;if not, it is an even number MOV AL,4 ;odd number, set it to 4 JMP @continue ;continue execution @even: MOV AL,0 ;even number, set it to 0 @continue: ;execution continues
This works, but no matter what the number is, you’re going to take a jump. Here’s a branchless way of doing the same thing, recognizing that 4 is a power of two:
AND AL,1 ;AL = either 0 or 1 SHL AL,1 ;Shift AL left (AL = AL * 2) SHL AL,1 ;Shift AL left (AL = AL * 2) ;AL = 0 if it was originally even, 4 if odd
Let’s look at another common operation, converting a hex nybble value in AL to a printable ASCII character. First, the naive way:
CMP AL,10 ;Digit or letter? JB @1 ;Jump if a digit ADD AL,7 ;Adjust for letters @1: ADD AL,'0' ;Convert to hex printable char
This works, sure, but if you stop and think about the problem for a few minutes, you might come up with a really cool branchless way to do the same thing, such as Norbert Juffa did:
CMP AL,10 ;if x<10, set CF=1 SBB AL,69h ;0-9: 96h..9Fh, A-F: A1h..A6h DAS ;0-9: subtract 66h -> 30h-39h ;A-F: subtract 60h -> 41h-46h
(Of course, XLAT is the fastest way, but if space is at a premium or BX isn’t free, XLAT isn’t an option.
Let’s not forget about LOOP. LOOP is a branching instruction as well, and every time it loops, you lose 17 cycles. If you know you’ll be looping only a small number of times, unroll the loop. So this:
MOV CX,4 @trans: LODSB ;load byte from DS:SI to AL; inc SI XLAT ;translate using table at DS:BX STOSB ;store byte in AL to ES:DI; inc DI loop @trans ;decrease CX and loop while CX != 0
…turns into this:
LODSB ;load byte from DS:SI to AL; inc SI XLAT ;translate using table at DS:BX STOSB ;store byte in AL to ES:DI; inc DI LODSB ;(do it three more times) XLAT STOSB LODSB XLAT STOSB LODSB XLAT STOSB
By unrolling the loop, we not only save 17 cycles per jump taken, but we have a better chance of keeping the prefetch queue full (and all three of those are 1-byte opcodes, so it’s definitely full).
If you absolutely must branch, at least try to arrange your code so that the common cases “fall through” the conditional branch. This is because a conditional branch takes 4 cycles if the jump is not taken, but 16 if it jumps. Arranging your code properly requires advance knowledge of the problem you’re trying to solve. For example, if you are processing some data and only one value needs special handling, arrange your code like this:
CMP reg,immed JE @specialhandling
That way, most of the common cases will “fall through” in 4 cycles, with only the special case needing the full 16 cycles to jump to a new location for handling.
Random Optimization Tip Firehose
Use spare registers to preserve/restore variables when possible instead of the stack. If you need to preserve something and AX is free, XCHG AX,reg is smaller and faster than PUSH reg; POP reg.
Did you know JCXZ exists? JCXZ is a freebie left over from Intel implementing LOOP, and it jumps if CX=0, or falls through if it isn’t. So you can replace this:
OR CX,CX ;test if CX is 0 JZ @somewhere ;jump if it is
Putting It All Together
Here are some random examples that illustrate what we’ve been talking about.
Example 1: All programming can be viewed as an exercise in caching (Terje Mathisen)
Translating a buffer from one set of values to another? String ops and XLAT are your friend:
@trans: LODSB ;load byte from DS:SI to AL; inc SI XLAT ;translate using table at DS:BX STOSB ;store byte in AL to ES:DI; inc DI loop @trans ;decrease CX and loop while CX != 0
Hey wait a second, we have 16-bit versions of those! We can use the 16-bit versions to speed things up by loading and storing two bytes at a time:
@trans: LODSW ;load word from DS:SI to AL; SI=SI+2 XLAT ;translate AL using table at DS:BX XCHG AH,AL ;swap AH and AH XLAT ;translate AL again XCHG AH,AL ;swap AH and AH STOSW ;store word in AL to ES:DI; DI=DI+2 loop @trans ;decrease CX and loop while CX != 0
While the 8086 gets a bigger impact from this due to being able to snag both bytes in 4 cycles, it is still a speed win on 8088 because we are loading and storing double the bytes with the same 1-byte load/store string instructions. And although we have three more instructions sandwiched inbetween, the prefetch queue helps out just enough that the overall speed is faster than the first example.
Example 2: Working in 16 bits, branchless
Copying a block of memory around and you don’t know how large the block will be beforehand? The general-purpose way is this:
MOV CX,immed REP MOVSB
That handles both even and odd lengths, but as we saw from the previous example, it is faster to do things 16 bits at a time. Here’s a way to use the 16-bit string ops and also correctly handle odd lengths, and all without branches:
MOV CX,immed SHR CX,1 ;CX = CX / 2; even/odd bit went into carry REP MOVSW ;Copy by words ADC CX,CX ;Add carry to CX; CX = 1 if odd, 0 if even REP MOVSB ;Will copy either 1 byte if odd or do nothing if even
Example 3: Instruction reordering
Try to reorganize code with the prefetch queue in mind if changing the order of instructions doesn’t affect what your code is trying to do. Here’s a sample “before” sequence:
INC SI ADD SI,BX MUL CX CMP AL,3 JBE loc
In the above, we notice that MUL takes a long time, there is a prefetch-queue-emptying jump shortly after it, and the stuff above the MUL isn’t dependent on or affect MUL’s result. So by simply rearranging some code to come after the MUL…
MUL CX INC SI ADD SI,BX CMP AL,3 JBE locr
…we can guarantee the prefetch queue will fill up while the slow MUL is executing, and then what’s in the queue gets executed faster before the queue eventually gets emptied.
Reference Materials and Further Reading
Thus ends our coverage of general tips and tricks. Believe it or not, this article is really only a beginner’s guide to 808x optimization; for the most comprehensive coverage of writing fast 808x code, I highly recommend Michael Abrash’s seminal work on the subject, Zen of Assembly Language. That book, as well as other helpful references for programming the 8088 (and CGA too) including Intel’s original documentation on the 8086, can be found inside ftp://ftp.oldskool.org/pub/misc/8088%20Programming.rar if you can’t find a dead tree copy elsewhere (everything in that archive is long out of print). All PDFs inside the archive have been OCR’d and are searchable. (One note about the Intel’s 8086 documentation: All timings are for both byte and word transfers. If you’re counting cycles on 8088, be sure to add 4 additional cycles for word transfers.)
There are some additional techniques for gaining even more speed on the IBM PC that are not specifically 8088-related, such as temporarily disabling DRAM refresh, copying memory using the DMA controller, or getting the entire machine into lockstep so that you can use software timing for events instead of interrupts. Those are highly specialized and beyond the scope of this guide, so I won’t be covering them any time soon, sorry. (They’re also difficult to pull off without crashing the computer, and don’t work on machines that aren’t truly 100% compatible with the original IBM PC, so you’re not missing much.)
In Part 3, I’ll show you how I utilized all of the above knowledge to implement realtime decompression on the 8088 (and by “realtime” I mean “decompression faster than REP MOVSW”). Not a misprint! Stay tuned.