Oldskooler Ramblings

the unlikely child born of the home computer wars

Optimizing for the 8088 and 8086 CPU: Part 1

Posted by Trixter on January 10, 2013


There is a small but slowly growing hobby around retroprogramming for old PCs and compatibles. This hobby has existed for decades for other platforms, as evidenced by the active demoscenes on each retro platform, but the IBM PC (and other 4.77MHz 8088 compatibles) has only recently started to gain that same sort of attention. As a public service to the 8088 retroprogramming community — “All four of you, huh?” — I’ve decided to write a crash-course on optimizing your code for maximum speed on the 8088. This information is targeted to people who already know either modern x86 assembly or assembly for other CPUs, and are programming for the 8088 or 8086 for the first time (or the first time in a long while).

Before we begin, let me clarify that while I’m using “8088” throughout most of this text, what I am writing applies equally to the 8086 as well. The 8086 and 8088 are functionally identical, with the 8086 being slightly faster due to a having a 16-bit bus and a larger prefetch queue, both of which are covered later. Despite the extra speed, what holds for 8088 optimization also applies to the 8086, so you can just equate the two for the remainder of this guide.

Contrary to what you might think about a CPU old enough to be Justin Bieber’s father, it is possible to wring acceptable speed out of an 8088 if you understand the situation most 8088s are forced into (the slow RAM of the IBM PC) and how to deal with it, as well as the CISC-like advantages the chip has. By understanding both, it is possible to write assembly code that can run faster than the best 6502 or Z80 code at similar clock speeds. Let’s look at both sides, and because I like hearing good news after bad, we’ll start with the bad news.

Disadvantages of the 8088

Slow RAM access. While other CPUs of the 1970s enjoy single-cycle access to a byte of memory, the 8088 takes 4 cycles to access a byte. The 8086 is a little better, and can access either one or two bytes in 4 cycles.

Tiny prefetch queue. The 8088 is made up of two halves, the Execution Unit (EU) and the Bus Interface Unit (BIU). They can work more or less independently, with the BIU grabbing the next instruction opcodes while the EU works on the previous ones. The only drawback to this arrangement is that the BIU only has a 4-byte buffer (the prefetch queue). So it can only “cache” up to 4 bytes in advance to feed the EU. As you can imagine, it is empty most of the time, because instructions execute faster than they can be fetched thanks to the sucky access time I mentioned in the previous paragraph. The 8086, again, is a little better; it has a 6-byte queue that it can usually keep full thanks to it’s faster RAM access.

Register specialization. The 8088 has four general-purpose registers, but all four of them are tied to specialized functions that you can’t use if you’re using them for, well, general purposes. For example, it’s possible to do a tight loop of some operations, but that uses CX as the counter, so you can’t use CX inside the loop.

Advantages of the 8088

CISC architecture. CISC was all the rage at Intel in 1976, so they built in some “metainstructions” that do several things with a single opcode. For example, MOVSB will copy a byte from DS:SI to ES:DI, then advance both SI and DI so you can do it again, and it does all three in much less time than if you did it yourself. Couple MOVSB with the REP prefix and you can do this repeatedly at high speed. XLAT is another one, which will replace a register value with a like value in a translation table. We’ll cover some of the better ones later in this guide.

Multiplication and division. Unlike most home computer CPUs of its era, the 8088 has a built in MUL and DIV. If you need to do 16-bit multiplies or divisions where the operands/divisors are not known beforehand, nothing beats them. (If you need to mul/div smaller values, however, they’re slower than they should be and you can usually beat them using Quarter Square Multiplication or shift-and-adding.)

Register specialization. Hey, didn’t we mention this previously as a negative? Yes, but sometimes the specialization works in our favor. For example, some opcode encodings that use the accumulator (AX) are only a single byte in size, and as we’ll see later on, smaller is better. Heck, let’s start now:

Smaller Is Better

Because it takes 4 cycles to read a byte, and because the prefetch queue is so tiny, smaller code is usually better. When writing or optimizing your code, keep the following chart handy, which lists all general-purpose 8088 1-byte opcodes. (This is not a full list, just a list of what I find most useful. For a full list, read the 8086 Family Users Manual from Intel.) If you can replace something you’re doing with one of these instructions, it’s almost always a win.

Opcode Instruction Description
37 AAA ASCII adjust AL (carry into AH) after addition
3F AAS ASCII adjust AL (borrow from AH) after subtraction
27 DAA Decimal adjust AL after addition
2F DAS Decimal adjust AL after subtraction
98 CBW Convert byte into word (AH = top bit of AL)
99 CWD Convert word to doubleword (DX = top bit of AX)
F8 CLC Clear carry flag
F9 STC Set carry flag
F5 CMC Complement carry flag
EC IN AL,DX Input byte from port DX into AL
9F LAHF Load: AH = flags SF ZF xx AF xx PF xx CF
9E SAHF Store AH into flags SF ZF xx AF xx PF xx CF
EE OUT DX,AL Output byte AL to port number DX
0E PUSH CS Set [SP-2] to CS, then decrement SP by 2
1E PUSH DS Set [SP-2] to DS, then decrement SP by 2
06 PUSH ES Set [SP-2] to ES, then decrement SP by 2
16 PUSH SS Set [SP-2] to SS, then decrement SP by 2
1F POP DS Set DS to top of stack, increment SP by 2
07 POP ES Set ES to top of stack, increment SP by 2
17 POP SS Set SS to top of stack, increment SP by 2
9C PUSHF Set [SP-2] to flags register, then decrement SP by 2
9D POPF Set flags register to top of stack, increment SP by 2
C3 RETN Return to near caller (pop offset only)
D7 XLATB Set AL to memory byte DS:[BX + unsigned AL]

Something handy to print out and keep next to you while you code.

Accumulating Speed

Register specializations suck, but when it comes to the accumulator (AX), Intel built in optimized forms of instructions that are one byte shorter, one cycle faster, or both. Try to reorganize your code so that AX or AL can be used for these optimized forms, especially in an inner loop: (“accum” here means either AX or AL, and “immed” means any immediate value)

Instruction Description
ADD accum,immed Add
SUB accum,immed Subtract
ADC accum,immed Add with carry
SBB accum,immed Subtract with borrow
AND accum,immed Logical AND
OR accum,immed Logical OR
XOR accum,immed Logical Exclusive-OR
IN AL,DX Read from port
OUT DX,AL Write to port
MOV mem,accum Copy to memory
MOV accum,mem Copy to register
CMP accum,immed Compare (perform subtraction, but only set flags)
TEST accum,immed Test (perform logical AND, but only set flags)
XCHG reg,AX Exchange values

That last one is a doozy; XCHG reg,AX is 1 byte and 3 cycles. This was part of Intel’s plan to pair both LOCK and XCHG together as a way to implement atomic semaphores, so they optimized it in the silicon. (See page 2-18 of the 8086 Family Users Manual for details.)

It’s Called A Coprocessor For A Reason

If your project is going to work with floating point and needs both speed and accuracy, read this section. If you know you’ll never need that, skip to the next section.

Everyone knows that the 8087 math coprocessor is much faster than the 8088 if you need to perform IEEE floating point math operations (including square roots, tangents, arctangents, etc.). What people seem to forget is the word “coprocessor” in the name. It’s a true coprocessor, which means it can be crunching away on an operation in the background while the 8088 is off doing something else.

This is HUGE. If your program needs the accuracy of IEEE floating point and has to do a lot of difficult slow stuff with it, you can essentially get background computing of floating point for free. The sequence of operations for 8088 code is essentially this:

  1. Load 8087’s stack with values
  2. Give it a command
  3. Go off and do whatever you want
  4. When you’re ready for the result, issue a WAIT
  5. Pop your result(s) off the 8087 stack

This almost feels like cheating.

Stringing Up The CPU

Another byproduct of Intel’s CISC rage were the string instructions. These are ludicrously powerful in the right circumstances, and you should use them whenever possible. Intel called them “string operations” because they were designed to assist in text string manipulation. Each of these opcodes are golden for three reasons:

  • They are 1 byte long
  • They perform multiple operations faster than the individual steps would take
  • They can be automatically REPeated without using any jump/loop instructions

Here are the five string instructions:

LODSB – Load byte from DS:SI into AL, then advance SI
STOSB – Store byte in AL to ES:DI, then advance DI
MOVSB – Copy byte from DS:SI to ES:DI, then advance both SI and DI
SCASB – Load byte from ES:DI and compare it to AL (sets flags equal to a subtraction), then advance DI
CMPSB – Compare byte from DS:SI to byte at ES:DI (sets flags equal to a subtraction), then advance both SI and DI

But wait, that’s not all! You also get 16-bit versions of the same instructions!

LODSW – Load word from DS:SI into AL, then advance SI +2
STOSW – Store word in AX to ES:DI, then advance DI +2
MOVSW – Copy word from DS:SI to ES:DI, then advance both SI and DI +2
SCASW – Load word from ES:DI and compare it to AX (sets flags equal to a subtraction), then advance DI +2
CMPSW – Compare word from DS:SI to word at ES:DI (sets flags equal to a subtraction), then advance both SI and DI +2

These can be called individually, but really shine when they are used with a REP prefix, which will repeat them for CX times (meaning, REP will run a string instruction, then decrement CX, then if CX is 0 it will stop). The last two are used with additional repeat prefixes: REPE/REPZ (repeat while equal/zero) and REPNE/REPNZ (repeat while not equal/not zero), so that the loop ends (or continues) based on the result of the comparison.

There’s a lot you can make fun of the 8088 for, but nobody makes fun of the string instructions. If you need to copy memory around, scan a buffer for a value, fill a buffer to a certain value, or compare two buffers for equality, they are an order of magnitude faster than doing things the long way. I mean, seriously, a single CMPSW done the long way would look like this:

PUSH AX        ; CMPSW doesn't change any registers, so we can't either
PUSH BX
MOV AX,DS:[SI] ; Load DS:SI somewhere
ADD SI,2       ; Advance SI
MOV BX,ES:[DI] ; Load ES:DI somewhere
ADD DI,2       ; Advance DI
CMP AX,BX      ; Do the comparison (sets flags equal to subtraction)
POP BX         ; CMPSW doesn't change any registers, so we can't either
POP AX

Maybe now you’ll understand why I love the string opcodes so much!

This is the end of Part 1 of our crash course. In Part 2, I’ll continue with various tips and examples. In Part 3, I’ll present a case study that shows what kind of benefit you can realize from taking the time to optimize for speed.

Advertisements

28 Responses to “Optimizing for the 8088 and 8086 CPU: Part 1”

  1. Covoxer said

    > By understanding both, it is possible to write assembly code that can run as fast as the best 6502 or Z80 code at similar clock speeds.

    This statement is FALSE. With optimizations you can write code that is much faster than the best competing 8-bit CPU’s of that time. Writing lame code would result in about the same performance as optimized code for 8-bit CPU’s. I have already commented (with examples) this confusion of yours sometime ago here: https://trixter.oldskool.org/2011/06/04/at-a-disadvantage/

    Also, can you name any other personal computers of that era running at “similar clock speed”?

    > While other CPUs of the 1970s enjoy single-cycle access to a byte of memory, the 8088 takes 4 cycles to access a byte.

    This statement is FALSE again! How many cycles (I suppose you actually mean “clocks”?) did it take to access memory in 8080 or Z80?

    6502 had 1 clock access, but that limited most systems to 1MHz, resulting in slower memory access than 8088 (yes, the great C=64 had slower memory than the lame 8088 IBM PC !).
    It is also important to mention that many of the 8-bit personal computers of the era shared RAM between CPU and Video, often resulting in tremendous drop in actual memory speed.

    Your bashing of the 8088 is really unfair. In reality, IBM PC was at least as fast as any 8-bit personal computer of that time, even if you didn’t optimize the code. All these optimizations (most of them were not even considered to be optimizations, it was straightforward assembly programming, like using 8 bit memory addressing in 6502, which you’d hardly call an optimizing) resulted in much (in some cases – many times) faster code than any 8 bit PC could possibly achieve.

    • Roy Jacobs said

      How can you claim, in all seriousness, someone who is clearly in love with the platform to be ‘bashing’ it?

      • Covoxer said

        He, he. :-) Well, if this is the case, the reason may be to exaggerate something (the effect of the described optimizations for example – if you don’t use it, your code would be crawling slower than 6502 [lie]) or the overall success of the 8088 programmers (the slower PC was, the more impressive all the software running on it would look).
        I don’t know. But what I do know, is a real world performance of the 8088 in IBM PC and how it compares to the rest of the PC’s of the era. And it’s not hard to prove.
        Anyway, for whatever reason, Trixter is unfairly bashing 8088 here as he did here: https://trixter.oldskool.org/2011/06/04/at-a-disadvantage/
        Here’s quote: “The original IBM PC, despite appearances and bias on the part of both consumers and marketing, was actually the slowest popular personal computer on the market at the time of its release, even compared to the Apple II and Atari 400.”
        If that’s not bashing, then what is it?

        • Trixter said

          It’s meant to generate conversation and prove me wrong.

          Roy correctly read between the lines. I love the 8088. (Actually, I love the 8086 — the 8088 is hobbled.)

          • Covoxer said

            So we have a conversation. :-)

            Yes, of course 8086 was much better. Well, at least 8088 had not suffered this bus truncation as badly as 68008 did.

    • Trixter said

      I’ve updated the article to say “*faster* than the best 6502 or z80 code” since you feel so strongly about it. However, with the 6502 enjoying single-clock access to memory, I just can’t agree in spirit. 4-clock access to a byte on 8088 really hobbles the entire machine. Many common 8088 opcode forms are 2-4 bytes long; aren’t all 6502 opcodes 1 byte in size?

      Most people trying to program on 8088 for fun today are using compilers, so the code is fairly slow. Even compiled C code is slow. And if you decide to convert to naive assembler, it is STILL no guarantee it will run acceptably. That was the motivation for my guide, to show people that if you’re going to produce something fantastic on an 8088 PC, you’re really going to have to put the effort in.

      With the exception of Microsoft Flight Simulator, I would love to be proven wrong and shown some examples of games that ran significantly worse on C64/Z80 and/or significantly better on 8088. And remember, I’m talking about a 4.77Mhz 8088 with CGA, nothing faster. I really can’t think of any, with maybe the exception of Elite and maybe Stunt Track Racer.

  2. Covoxer said

    > The only drawback to this arrangement is that the BIU only has a 4-byte buffer (the prefetch queue).

    This is a drawback comparing to what? All the other competing CPU’s had 1 byte for these purposes. How can 4 bytes be a disadvantage to 1 byte?

    > The 8088 has four general-purpose registers, but all four of them are tied to specialized functions that you can’t use if you’re using them for

    It is a disadvantage comparing to what? All the competing CPU’s had accumulator ISA. How can accumulator based instruction set be better than what we have in 8088?
    Besides, comparing to 8 bit CPU’s, you should mention that 8088 had EIGHT general purpose 8 bit registers: al, ah, bl, bh, cl, ch, dl, dh.

    > For example, it’s possible to do a tight loop of some operations, but that uses CX as the counter, so you can’t use CX inside the loop.

    Do you imply that 6502 and Z80 were better in this respect?

    • Trixter said

      You are taking my comparison to the 6502/Z80 at the beginning of the guide and applying it to the rest of the guide. This was never implied by me. As soon as the guide starts proper, I am no longer referring to other 8-bit platforms.

      Yes, 8088 had 8 8-bit registers. Didn’t 6502 have 256 registers using zero-page? If not the same thing, disregard.

      Z80 has similar registers to the 8088 and it also has an alternate set that you can swap in and out. Z80 was not limited memory-wise that the PC was, and ran at 3.5Mhz in common implementations (Spectrum). I think it’s a valid argument to say the Z80 was faster than 8088, but if I am wrong about the Z80, I don’t mind being corrected.

      I think you’re misinterpreting the spirit of the guide. The intended spirit of the guide is to list all of the common pitfalls and how to get around them, and to motivate people to WANT to get around them in an energetic writing style. I’m a demoscener, not business systems developer.

      • Covoxer said

        Ah, you meant disadvantages to 8086. Sorry, got you wrong.

        No, 6502’s 256 registers are not the same since they require extra memory access and you can’t store op result in any of them, only in accumulator.

        Yes, Z80 at 3.5MHz is much closer to 8088 at 4.77MHz. But I wold still vote for 8088 in general case. Z80 was still an accumulator based ISA. It had next to no 16 bit support. It had less instructions (no multiplication for example). The second register set was useful for quick context switching but was next to useless in means of doubling number of general purpose registers since you couldn’t easily use both banks simultaneously, only through memory. Also 8088 had more advanced addressing modes. And it had no 4 bytes queue. In our “in disadvantage” discussion, I have posted three more or less useful code samples. In two of them, Z80 was as fast as 8088 (3.5 vs 4.77 MHz), in the third one, Z80 was about 15 times slower due to microcoded multiplication in 8088 (all that comparing 3.5MHz Z80 with 4.77MHz 8088).
        Also, Drystone was about 4 times slower on 4MHz Z80 than on 4.77MHz 8088. Basically, suggesting that one could bother less about code optimization on 8088 than Z80. ;-)

        • Trixter said

          In the third part (which I’m still writing; part 2 is done and will be posted soon), I plan to cover how I was able to start with an assembler routine and, through iterative redesign and changes, was able to speed it up roughly 25%. I think that 25% is significant enough that it is worth optimizing 8088 code. Unfortunately, as most software written suggests, most people thought 8088 was “good enough” and didn’t spend too much time optimizing.

          I’ve seen C64 demos that display a rotating environment-mapped torus. Yes, it’s roughly 2fps, and I know its effective resolution is probably 80×50 or 40×25, but it’s still realtime. It makes me weep for my platform since I can’t see how that is possible on 8088.

          • Covoxer said

            Can you give me a link to that C64 demo you are talking about?
            I’m pretty sure that if it doesn’t use any VIC II features it can be done on PC (except for CGA colors limitation that is).

            • Trixter said

              I’m afraid I can’t find the name of the demo, although Mathematica by Reflex has a gouraud-shaded (no textures) torus at a decent framerate around the 7:15 mark.

              • Covoxer said

                The environment mapped sphere was a famous Amiga demo. Not sure about environment mapped torus on C64.

                The torus in Mathematica is not a rendering of the 3D object (obviously). It is simply a plot of the torus equation (where Z is used for the pixel color). With the use of lookup tables it can be very quick (like rotating cubes or fractals). No doubts this can be implemented on IBM PC (running in text mode – you’d have about the same resolution ;-) ).

            • Trixter said

              Mathematica was a bad example, sorry. For a real one, look at numen by taQuart. Some of the stuff in there is amazing for 1.77 MHz. They were very kind to release the sources, so I’ll have a look (now I have to learn 6502!)

              How can you use lookup tables to plot fractals? I’ve seen some neat julias on 8bit (both orbitals and the normal plot) but I can’t see how to use lookup tables to speed them up. I can draw one frame of mandelbrot 40×25 in about a second using 32-bit integer math…

  3. Optimus said

    Wow! I will keep reading these. There are enough opcodes and information I didn’t even know before. I wonder how these extrapolate to 386 programming. I say this because this is the oldest PC I have in my room right now and would like to learn how cycle counting works there, if it’s necessary, where I can gain from. But then again, I can read some chapters from Abrash’s Black Book for this and keep reading your guide :)

    • Trixter said

      Some of the advice in the guide applies to all optimization (such as reading/writing memory as little as possible), but be careful — much of the guide is appropriate for the 8086 and 8088 ONLY. For example, you definitely do NOT want to use XLAT, LOOP, or any of the BCD stuff on Pentium and higher as they are much slower. Even the very next CPU up, the 80186, has features like shifting and rotating using immediate values, and on the 80286 and higher the MUL/DIV are much much faster and you should use them for almost everything possible.

      Once you hit 386, it’s a whole new game. 32-bit-wide registers, additional segment registers, more addressing modes (which means LEA gains tremendous power), and if you work in protected mode you have true 32-bit pointers (and more speed, since you don’t need 66h//67h modifiers in front of everything). There’s a lot you have to learn/re-learn. If you plan on targeting a 386 or higher, this guide is NOT what you should be reading (except as a historical curiosity, or maybe to get some ideas that *do* apply to your target platform).

      • The 386SX is so severely memory constrained that many of the 8088 style size optimizations that would result in a loss on 386DX become a win on the 386SX. Curiously, in the modern x64 CPU bloatware era, aggressive space optimizations may actually run faster than their book numbers would indicate. If you can keep code in the small L1’s its going to be like 8-10X faster than a fully unrolled size porked “speed optimized” version that blows out the L1 and winds up flogging the L2.

  4. Optimus said

    Then again, do you know any PC emulator that tries to be more precise in cycles and can target a 8088/8086?
    I read all these discussions whether Z80/6502 or 8088 is better and as a Z80/6502 coder it makes me curious to try coding something on the 8088 and compare, but in an emulator that I can test performance fairly correctly. Dosbox wouldn’t make it for this.

    • Trixter said

      So far, PCem is the only emulator I’ve used where the author made a conscious effort to be cycle-exact. The end result is not quite cycle-exact, but there is a lot of attention to detail (he attempts to emulate CGA snow, for example) and it is as good as you will get without being in front of a real machine. Link to PCem is in my previous post.

  5. […] Optimizing for the 8088 and 8086 CPU: Part 1 […]

  6. […] Optimizing for the 8088 and 8086 CPU: Part 1 […]

  7. Mikkel Christiansen said

    PUSH AX ; CMPSW doesn’t change any registers, so we can’t either
    PUSH BX
    MOV AX,DS:[SI] ; Load DS:SI somewhere
    ADD SI,2 ; Advance SI
    MOV BX,ES:[DI] ; Load ES:DI somewhere
    ADD DI,2 ; Advance DI
    CMP AX,BX ; Do the comparison (sets flags equal to subtraction)
    POP BX ; CMPSW doesn’t change any registers, so we can’t either
    POP AX

    You don’t seem to take your own advice.

    PUSH AX ; CMPSW doesn’t change any registers, so we can’t either
    MOV AX,DS:[SI] ; Load DS:SI somewhere
    ADD SI,2 ; Advance SI
    CMP AX,ES:[DI] ; Do the comparison (sets flags equal to subtraction)
    LAHF ; Save flags
    ADD DI,2 ; Advance DI
    SAHF ; Restore flags
    POP AX ; CMPSW doesn’t change any registers, so we can’t either

  8. […] Optimizing for the 8088 and 8086 – I love retro computing articles like this […]

  9. […] already done on the project). Almost every clone maker chose, not the more efficient 8086 with its 16-bit data bus and larger prefetch queue that resulted in a 50% reduction in I/O […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: