Optimizing for the 8088 and 8086 CPU: Part 1
Posted by Trixter on January 10, 2013
There is a small but slowly growing hobby around retroprogramming for old PCs and compatibles. This hobby has existed for decades for other platforms, as evidenced by the active demoscenes on each retro platform, but the IBM PC (and other 4.77MHz 8088 compatibles) has only recently started to gain that same sort of attention. As a public service to the 8088 retroprogramming community — “All four of you, huh?” — I’ve decided to write a crash-course on optimizing your code for maximum speed on the 8088. This information is targeted to people who already know either modern x86 assembly or assembly for other CPUs, and are programming for the 8088 or 8086 for the first time (or the first time in a long while).
Before we begin, let me clarify that while I’m using “8088” throughout most of this text, what I am writing applies equally to the 8086 as well. The 8086 and 8088 are functionally identical, with the 8086 being slightly faster due to a having a 16-bit bus and a larger prefetch queue, both of which are covered later. Despite the extra speed, what holds for 8088 optimization also applies to the 8086, so you can just equate the two for the remainder of this guide.
Contrary to what you might think about a CPU old enough to be Justin Bieber’s father, it is possible to wring acceptable speed out of an 8088 if you understand the situation most 8088s are forced into (the slow RAM of the IBM PC) and how to deal with it, as well as the CISC-like advantages the chip has. By understanding both, it is possible to write assembly code that can run faster than the best 6502 or Z80 code at similar clock speeds. Let’s look at both sides, and because I like hearing good news after bad, we’ll start with the bad news.
Disadvantages of the 8088
Slow RAM access. While other CPUs of the 1970s enjoy single-cycle access to a byte of memory, the 8088 takes 4 cycles to access a byte. The 8086 is a little better, and can access either one or two bytes in 4 cycles.
Tiny prefetch queue. The 8088 is made up of two halves, the Execution Unit (EU) and the Bus Interface Unit (BIU). They can work more or less independently, with the BIU grabbing the next instruction opcodes while the EU works on the previous ones. The only drawback to this arrangement is that the BIU only has a 4-byte buffer (the prefetch queue). So it can only “cache” up to 4 bytes in advance to feed the EU. As you can imagine, it is empty most of the time, because instructions execute faster than they can be fetched thanks to the sucky access time I mentioned in the previous paragraph. The 8086, again, is a little better; it has a 6-byte queue that it can usually keep full thanks to it’s faster RAM access.
Register specialization. The 8088 has four general-purpose registers, but all four of them are tied to specialized functions that you can’t use if you’re using them for, well, general purposes. For example, it’s possible to do a tight loop of some operations, but that uses CX as the counter, so you can’t use CX inside the loop.
Advantages of the 8088
CISC architecture. CISC was all the rage at Intel in 1976, so they built in some “metainstructions” that do several things with a single opcode. For example, MOVSB will copy a byte from DS:SI to ES:DI, then advance both SI and DI so you can do it again, and it does all three in much less time than if you did it yourself. Couple MOVSB with the REP prefix and you can do this repeatedly at high speed. XLAT is another one, which will replace a register value with a like value in a translation table. We’ll cover some of the better ones later in this guide.
Multiplication and division. Unlike most home computer CPUs of its era, the 8088 has a built in MUL and DIV. If you need to do 16-bit multiplies or divisions where the operands/divisors are not known beforehand, nothing beats them. (If you need to mul/div smaller values, however, they’re slower than they should be and you can usually beat them using Quarter Square Multiplication or shift-and-adding.)
Register specialization. Hey, didn’t we mention this previously as a negative? Yes, but sometimes the specialization works in our favor. For example, some opcode encodings that use the accumulator (AX) are only a single byte in size, and as we’ll see later on, smaller is better. Heck, let’s start now:
Smaller Is Better
Because it takes 4 cycles to read a byte, and because the prefetch queue is so tiny, smaller code is usually better. When writing or optimizing your code, keep the following chart handy, which lists all general-purpose 8088 1-byte opcodes. (This is not a full list, just a list of what I find most useful. For a full list, read the 8086 Family Users Manual from Intel.) If you can replace something you’re doing with one of these instructions, it’s almost always a win.
|37||AAA||ASCII adjust AL (carry into AH) after addition|
|3F||AAS||ASCII adjust AL (borrow from AH) after subtraction|
|27||DAA||Decimal adjust AL after addition|
|2F||DAS||Decimal adjust AL after subtraction|
|98||CBW||Convert byte into word (AH = top bit of AL)|
|99||CWD||Convert word to doubleword (DX = top bit of AX)|
|F8||CLC||Clear carry flag|
|F9||STC||Set carry flag|
|F5||CMC||Complement carry flag|
|EC||IN AL,DX||Input byte from port DX into AL|
|9F||LAHF||Load: AH = flags SF ZF xx AF xx PF xx CF|
|9E||SAHF||Store AH into flags SF ZF xx AF xx PF xx CF|
|EE||OUT DX,AL||Output byte AL to port number DX|
|0E||PUSH CS||Set [SP-2] to CS, then decrement SP by 2|
|1E||PUSH DS||Set [SP-2] to DS, then decrement SP by 2|
|06||PUSH ES||Set [SP-2] to ES, then decrement SP by 2|
|16||PUSH SS||Set [SP-2] to SS, then decrement SP by 2|
|1F||POP DS||Set DS to top of stack, increment SP by 2|
|07||POP ES||Set ES to top of stack, increment SP by 2|
|17||POP SS||Set SS to top of stack, increment SP by 2|
|9C||PUSHF||Set [SP-2] to flags register, then decrement SP by 2|
|9D||POPF||Set flags register to top of stack, increment SP by 2|
|C3||RETN||Return to near caller (pop offset only)|
|D7||XLATB||Set AL to memory byte DS:[BX + unsigned AL]|
Something handy to print out and keep next to you while you code.
Register specializations suck, but when it comes to the accumulator (AX), Intel built in optimized forms of instructions that are one byte shorter, one cycle faster, or both. Try to reorganize your code so that AX or AL can be used for these optimized forms, especially in an inner loop: (“accum” here means either AX or AL, and “immed” means any immediate value)
|ADC accum,immed||Add with carry|
|SBB accum,immed||Subtract with borrow|
|AND accum,immed||Logical AND|
|OR accum,immed||Logical OR|
|XOR accum,immed||Logical Exclusive-OR|
|IN AL,DX||Read from port|
|OUT DX,AL||Write to port|
|MOV mem,accum||Copy to memory|
|MOV accum,mem||Copy to register|
|CMP accum,immed||Compare (perform subtraction, but only set flags)|
|TEST accum,immed||Test (perform logical AND, but only set flags)|
|XCHG reg,AX||Exchange values|
That last one is a doozy; XCHG reg,AX is 1 byte and 3 cycles. This was part of Intel’s plan to pair both LOCK and XCHG together as a way to implement atomic semaphores, so they optimized it in the silicon. (See page 2-18 of the 8086 Family Users Manual for details.)
It’s Called A Coprocessor For A Reason
If your project is going to work with floating point and needs both speed and accuracy, read this section. If you know you’ll never need that, skip to the next section.
Everyone knows that the 8087 math coprocessor is much faster than the 8088 if you need to perform IEEE floating point math operations (including square roots, tangents, arctangents, etc.). What people seem to forget is the word “coprocessor” in the name. It’s a true coprocessor, which means it can be crunching away on an operation in the background while the 8088 is off doing something else.
This is HUGE. If your program needs the accuracy of IEEE floating point and has to do a lot of difficult slow stuff with it, you can essentially get background computing of floating point for free. The sequence of operations for 8088 code is essentially this:
- Load 8087’s stack with values
- Give it a command
- Go off and do whatever you want
- When you’re ready for the result, issue a WAIT
- Pop your result(s) off the 8087 stack
This almost feels like cheating.
Stringing Up The CPU
Another byproduct of Intel’s CISC rage were the string instructions. These are ludicrously powerful in the right circumstances, and you should use them whenever possible. Intel called them “string operations” because they were designed to assist in text string manipulation. Each of these opcodes are golden for three reasons:
- They are 1 byte long
- They perform multiple operations faster than the individual steps would take
- They can be automatically REPeated without using any jump/loop instructions
Here are the five string instructions:
LODSB – Load byte from DS:SI into AL, then advance SI
STOSB – Store byte in AL to ES:DI, then advance DI
MOVSB – Copy byte from DS:SI to ES:DI, then advance both SI and DI
SCASB – Load byte from ES:DI and compare it to AL (sets flags equal to a subtraction), then advance DI
CMPSB – Compare byte from DS:SI to byte at ES:DI (sets flags equal to a subtraction), then advance both SI and DI
But wait, that’s not all! You also get 16-bit versions of the same instructions!
LODSW – Load word from DS:SI into AL, then advance SI +2
STOSW – Store word in AX to ES:DI, then advance DI +2
MOVSW – Copy word from DS:SI to ES:DI, then advance both SI and DI +2
SCASW – Load word from ES:DI and compare it to AX (sets flags equal to a subtraction), then advance DI +2
CMPSW – Compare word from DS:SI to word at ES:DI (sets flags equal to a subtraction), then advance both SI and DI +2
These can be called individually, but really shine when they are used with a REP prefix, which will repeat them for CX times (meaning, REP will run a string instruction, then decrement CX, then if CX is 0 it will stop). The last two are used with additional repeat prefixes: REPE/REPZ (repeat while equal/zero) and REPNE/REPNZ (repeat while not equal/not zero), so that the loop ends (or continues) based on the result of the comparison.
There’s a lot you can make fun of the 8088 for, but nobody makes fun of the string instructions. If you need to copy memory around, scan a buffer for a value, fill a buffer to a certain value, or compare two buffers for equality, they are an order of magnitude faster than doing things the long way. I mean, seriously, a single CMPSW done the long way would look like this:
PUSH AX ; CMPSW doesn't change any registers, so we can't either PUSH BX MOV AX,DS:[SI] ; Load DS:SI somewhere ADD SI,2 ; Advance SI MOV BX,ES:[DI] ; Load ES:DI somewhere ADD DI,2 ; Advance DI CMP AX,BX ; Do the comparison (sets flags equal to subtraction) POP BX ; CMPSW doesn't change any registers, so we can't either POP AX
Maybe now you’ll understand why I love the string opcodes so much!
This is the end of Part 1 of our crash course. In Part 2, I’ll continue with various tips and examples. In Part 3, I’ll present a case study that shows what kind of benefit you can realize from taking the time to optimize for speed.