Oldskooler Ramblings

the unlikely child born of the home computer wars

Optimizing for the 8088 and 8086 CPU: Part 2

Posted by Trixter on January 11, 2013


Welcome back to our little crash course on how to optimize code for maximum speed on the 8088 and 8086 CPU. Let’s jump right back in with a quick way to transmogrify the contents of a buffer.

Lost In Translation

Everyone knows that lookup tables trade memory for speed; if you can calculate all possible values for a particular operation, then performing the “calculation” is reduced to retrieving a value from the table. Intel thought of this when they were designing instructions to help with strings and created XLAT, which stands for “translate”. XLAT replaces AL with whatever AL is referencing in a table pointed to by DS:BX, and it does it in a single byte and takes only 11 cycles to do so. So you can replace this:

XOR AH,AH
MOV SI,AX
MOV AL,[BX+SI]

with this:

XLAT

Think of XLAT as “MOV AL,DS:[BX+AL]” (except that [BX+AL] is illegal on the 8088 and 8086.)  Intel meant for XLAT to be “useful for translating characters from one code to another, the classic example being ASCII to EBCDIC or the reverse,” but nothing says you have to use it with strings. I’ve used it as a replacement for isolating nybbles, because the 8088 is slow at shifting if you have to do it more than a few times. For example, I can replace this:

MOV CL,4
SHR AL,CL

with just XLAT, with BX pointing to a table that has all 256 values shifted right by 4.  While this burns 256 bytes for the table, the end result is a shift that finishes in half the time than Intel’s actual shift instruction.  This has saved my bacon in inner loops.

Branching Out

Conditional branches (jumping to another location based on the result of a test) are a necessary evil of most programming languages. Unfortunately, they cost you in two ways. The first is the obvious cost — a jump takes some time to perform, which is never quick. The second cost is more subtle, but just as valid: When a jump is performed, everything in the prefetch queue gets jettisoned, because it is no longer the correct stream of instructions to be executed. You really want the prefetch queue to be as full as possible most of the time, because when it isn’t, the CPU has to wait around for its BIU to grab instructions. This is less of a problem on the 8088 because it is hard to keep the queue full, but on the 8086 where it’s easy to keep the queue full, this is a real penalty.

For those who need an example of why this is important, let’s say you were a bit of a nerd who loved to research everything you did (not really a stretch because you’re reading this guide right now!), and your girlfriend wanted to visit an art museum. To save time, you fetched a set of art books to read on the way there before it was time to leave. But just before you leave, your girlfriend changes her mind and wants to go to a baseball game instead. The art books you’re carrying now don’t apply to where you’re going, so you dump them and start fetching books on baseball instead, and now you’re both going to be late because it takes time to fetch books. Your girlfriend changing her mind is a conditional branch being taken, and dumping the first set of books because they don’t apply any more is emptying the prefetch queue.

Because of this property, you should always be thinking of ways to eliminate branches. This can range from obvious solutions down to decidedly non-linear thinking to come up with a clever solution. This is difficult to visualize, so let’s look at a few examples. Let’s say you needed to set a register value to either 0 if it is even, or 4 if it was odd. Here’s a straightforward way to do this:

  TEST AL,1     ;test if first bit set
  JZ @even      ;if not, it is an even number
  MOV AL,4      ;odd number, set it to 4
  JMP @continue ;continue execution
@even:
  MOV AL,0      ;even number, set it to 0
@continue:      ;execution continues

This works, but no matter what the number is, you’re going to take a jump. Here’s a branchless way of doing the same thing, recognizing that 4 is a power of two:

AND AL,1        ;AL = either 0 or 1
SHL AL,1        ;Shift AL left (AL = AL * 2)
SHL AL,1        ;Shift AL left (AL = AL * 2)
                ;AL = 0 if it was originally even, 4 if odd

Let’s look at another common operation, converting a hex nybble value in AL to a printable ASCII character. First, the naive way:

  CMP AL,10     ;Digit or letter?
  JB @1         ;Jump if a digit
  ADD AL,7      ;Adjust for letters
@1:
  ADD AL,'0'    ;Convert to hex printable char

This works, sure, but if you stop and think about the problem for a few minutes, you might come up with a really cool branchless way to do the same thing, such as Norbert Juffa did:

CMP AL,10       ;if x 30h-39h
                ;A-F: subtract 60h -> 41h-46h

(Of course, XLAT is the fastest way, but if space is at a premium or BX isn’t free, XLAT isn’t an option.

Let’s not forget about LOOP. LOOP is a branching instruction as well, and every time it loops, you lose 17 cycles. If you know you’ll be looping only a small number of times, unroll the loop. So this:

  MOV CX,4
@trans:
  LODSB         ;load byte from DS:SI to AL; inc SI
  XLAT          ;translate using table at DS:BX
  STOSB         ;store byte in AL to ES:DI; inc DI
  loop @trans   ;decrease CX and loop while CX != 0

…turns into this:

LODSB           ;load byte from DS:SI to AL; inc SI
XLAT            ;translate using table at DS:BX
STOSB           ;store byte in AL to ES:DI; inc DI
LODSB           ;(do it three more times)
XLAT
STOSB
LODSB
XLAT
STOSB
LODSB
XLAT
STOSB

By unrolling the loop, we not only save 17 cycles per jump taken, but we have a better chance of keeping the prefetch queue full (and all three of those are 1-byte opcodes, so it’s definitely full).

If you absolutely must branch, at least try to arrange your code so that the common cases “fall through” the conditional branch. This is because a conditional branch takes 4 cycles if the jump is not taken, but 16 if it jumps. Arranging your code properly requires advance knowledge of the problem you’re trying to solve. For example, if you are processing some data and only one value needs special handling, arrange your code like this:

CMP reg,immed
JE @specialhandling

That way, most of the common cases will “fall through” in 4 cycles, with only the special case needing the full 16 cycles to jump to a new location for handling.

Random Optimization Tip Firehose

Use spare registers to preserve/restore variables when possible instead of the stack. If you need to preserve something and AX is free, XCHG AX,reg is smaller and faster than PUSH reg; POP reg.

Did you know JCXZ exists? JCXZ is a freebie left over from Intel implementing LOOP, and it jumps if CX=0, or falls through if it isn’t. So you can replace this:

OR CX,CX        ;test if CX is 0
JZ @somewhere   ;jump if it is

with this:

JCXZ @somewhere

Putting It All Together

Here are some random examples that illustrate what we’ve been talking about.

Example 1: All programming can be viewed as an exercise in caching (Terje Mathisen)

Translating a buffer from one set of values to another? String ops and XLAT are your friend:

@trans:
  LODSB         ;load byte from DS:SI to AL; inc SI
  XLAT          ;translate using table at DS:BX
  STOSB         ;store byte in AL to ES:DI; inc DI
  loop @trans   ;decrease CX and loop while CX != 0

Hey wait a second, we have 16-bit versions of those! We can use the 16-bit versions to speed things up by loading and storing two bytes at a time:

@trans:
  LODSW         ;load word from DS:SI to AX; SI=SI+2
  XLAT          ;translate AL using table at DS:BX
  XCHG AH,AL    ;swap AH and AH
  XLAT          ;translate AL again
  XCHG AH,AL    ;swap AH and AH
  STOSW         ;store word in AX to ES:DI; DI=DI+2
  loop @trans   ;decrease CX and loop while CX != 0

While the 8086 gets a bigger impact from this due to being able to snag both bytes in 4 cycles, it is still a speed win on 8088 because we are loading and storing double the bytes with the same 1-byte load/store string instructions. And although we have three more instructions sandwiched inbetween, the prefetch queue helps out just enough that the overall speed is faster than the first example.

Example 2: Working in 16 bits, branchless

Copying a block of memory around and you don’t know how large the block will be beforehand? The general-purpose way is this:

MOV CX,immed
REP MOVSB

That handles both even and odd lengths, but as we saw from the previous example, it is faster to do things 16 bits at a time. Here’s a way to use the 16-bit string ops and also correctly handle odd lengths, and all without branches:

MOV CX,immed
SHR CX,1        ;CX = CX / 2; even/odd bit went into carry
REP MOVSW       ;Copy by words
ADC CX,CX       ;Add carry to CX; CX = 1 if odd, 0 if even
REP MOVSB       ;Will copy either 1 byte if odd or do nothing if even

Example 3: Instruction reordering

Try to reorganize code with the prefetch queue in mind if changing the order of instructions doesn’t affect what your code is trying to do. Here’s a sample “before” sequence:

INC SI
ADD SI,BX
MUL CX
CMP AL,3
JBE loc

In the above, we notice that MUL takes a long time, there is a prefetch-queue-emptying jump shortly after it, and the stuff above the MUL isn’t dependent on or affect MUL’s result. So by simply rearranging some code to come after the MUL…

MUL CX
INC SI
ADD SI,BX
CMP AL,3
JBE locr

…we can guarantee the prefetch queue will fill up while the slow MUL is executing, and then what’s in the queue gets executed faster before the queue eventually gets emptied.

Reference Materials and Further Reading

Thus ends our coverage of general tips and tricks. Believe it or not, this article is really only a beginner’s guide to 808x optimization; for the most comprehensive coverage of writing fast 808x code, I highly recommend Michael Abrash’s seminal work on the subject, Zen of Assembly Language. That book, as well as other helpful references for programming the 8088 (and CGA too) including Intel’s original documentation on the 8086, can be found inside ftp://ftp.oldskool.org/pub/misc/8088%20Programming.rar if you can’t find a dead tree copy elsewhere (everything in that archive is long out of print). All PDFs inside the archive have been OCR’d and are searchable.  (One note about the Intel’s 8086 documentation: All timings are for both byte and word transfers.  If you’re counting cycles on 8088, be sure to add 4 additional cycles for word transfers.)

There are some additional techniques for gaining even more speed on the IBM PC that are not specifically 8088-related, such as temporarily disabling DRAM refresh, copying memory using the DMA controller, or getting the entire machine into lockstep so that you can use software timing for events instead of interrupts. Those are highly specialized and beyond the scope of this guide, so I won’t be covering them any time soon, sorry.  (They’re also difficult to pull off without crashing the computer, and don’t work on machines that aren’t truly 100% compatible with the original IBM PC, so you’re not missing much.)

In Part 3, I’ll show you how I utilized all of the above knowledge to implement realtime decompression on the 8088 (and by “realtime” I mean “decompression faster than REP MOVSW”). Not a misprint! Stay tuned.

11 Responses to “Optimizing for the 8088 and 8086 CPU: Part 2”

  1. Terje Mathisen said

    I have a dirty trick for you, for when you’re storing or moving a possibly odd number of bytes:

    As long as it is safe to overwrite one additional target byte (maybe because you are in the middle of filling it anyway, as in LZ4), you can do the following:

    ; Match block starting at [di-2], so store words:
    mov ax,es:[di-2]
    inc cx ; Round up for the shift
    shr cx,1
    rep stosw ; Will include any final odd byte
    adc di,-1 ; Subtract unless original count was even!

    Another possible micro-optimization:

    Instead of testing for reg == 255, you can INC and check for zero:

    inc al
    jz more_bytes

    instead of
    cmp al,255
    je more_bytes

    • Trixter said

      Good catches! I am going to combine these along with Peter’s suggestions (to me over email) and put out a second release. He’s managed to squeeze out another 0.8% and I imagine these will bring that to a full 1% :-)

      I had never considered the possibility of overwriting the target in the name of speed — this is OK for LZ4 because LZ4 streams always end with five literals. I can’t wait to try this out tonight and time the result.

      • Trixter said

        I just realized the “inc al” trick is only a win if it’s “inc ax” because that’s the 1-byte form. But I’m pretty sure AH=0 so I should still be able to use this. I keep forgetting that INC doesn’t affect the carry flag!

        • Terje Mathisen said

          The INC trick only works with a full register, I actually had AH=255 for one of my attempts, that was an obvious (without actually running any code) loss compared to all the other places where I needed AH=0.

          I’m trying to avoid the need for XLAT:

          I know that 4 unrolled SHR CX,1 operations do take a while but the XLAT table reduces the maximum _input_ buffer size to 63.5K, and that might not be safe for all inputs.

          Putting the table in SS works of course, but then you need a segment override as well for the XLAT…

          • Trixter said

            Not sure if you checked the actual code yet but I put the table in the code segment on purpose so that the input buffer could be the full 64K (and not require any setup). So it’s CS: XLAT in the code, and the override is only a 6 cycle penalty. It’s still better than both SHR AL,1 4 times *and* MOV CL,4; SHR AL,CL.

        • you’d need to cbw before you can inc ax.

          • Trixter said

            Ah, true.

            I tried both that and just inc al and both were the same speed as the existing code.

            I tried the adc di,-1 last night and it was a win — it was only 7 microseconds on my test suite, but a win is a win :-) so Terje will get the credit for that in the next release which I hope to have published in the next 48 hours.

      • Terje Mathisen said

        I (intentionally) haven’t looked at at your code yet, but I have a question: Have you done the obvious thing and followed both of the outline extra-len loops with copies of the rest of the code, in order to avoid the need to jump back?

        Since lengths over 14 are very rare, I would assume this optimization to be correspondingly minimal!

        PS. I do agree that the LZ4 is beautiful, the algorithm lends itself to quite elegant code. :-)

        • Trixter said

          I did not, as the extra-len loops are called so infrequently. I did unroll it by half once, so that the “are we done yet” branch is also the branch that goes to the top of the loop again. It compiles to 446 bytes, no need to unroll again for a case that gets called maybe 10-20 times per 3000 :-)

          I am testing and timing your suggestions tonight; don’t download the code until the next version is done, else you might be suggesting things that have already been implemented. I will blog about the next version when it’s uploaded.

  2. nit: In example #1, when doing the `lodsw` instead of:

    LODSW ;load word from DS:SI to AL; SI=SI+2

    the comment should read (AX instead of AL):

    LODSW ;load word from DS:SI to AX; SI=SI+2

    (and the same for stosw)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: