January 2013
S	M	T	W	T	F	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Optimizing for the 8088 and 8086 CPU: Part 1

Posted by Trixter on January 10, 2013

There is a small but slowly growing hobby around retroprogramming for old PCs and compatibles. This hobby has existed for decades for other platforms, as evidenced by the active demoscenes on each retro platform, but the IBM PC (and other 4.77MHz 8088 compatibles) has only recently started to gain that same sort of attention. As a public service to the 8088 retroprogramming community — “All four of you, huh?” — I’ve decided to write a crash-course on optimizing your code for maximum speed on the 8088. This information is targeted to people who already know either modern x86 assembly or assembly for other CPUs, and are programming for the 8088 or 8086 for the first time (or the first time in a long while).

Before we begin, let me clarify that while I’m using “8088” throughout most of this text, what I am writing applies equally to the 8086 as well. The 8086 and 8088 are functionally identical, with the 8086 being slightly faster due to a having a 16-bit bus and a larger prefetch queue, both of which are covered later. Despite the extra speed, what holds for 8088 optimization also applies to the 8086, so you can just equate the two for the remainder of this guide.

Contrary to what you might think about a CPU old enough to be Justin Bieber’s father, it is possible to wring acceptable speed out of an 8088 if you understand the situation most 8088s are forced into (the slow RAM of the IBM PC) and how to deal with it, as well as the CISC-like advantages the chip has. By understanding both, it is possible to write assembly code that can run faster than the best 6502 or Z80 code at similar clock speeds. Let’s look at both sides, and because I like hearing good news after bad, we’ll start with the bad news.

Disadvantages of the 8088

Slow RAM access. While other CPUs of the 1970s enjoy single-cycle access to a byte of memory, the 8088 takes 4 cycles to access a byte. The 8086 is a little better, and can access either one or two bytes in 4 cycles.

Tiny prefetch queue. The 8088 is made up of two halves, the Execution Unit (EU) and the Bus Interface Unit (BIU). They can work more or less independently, with the BIU grabbing the next instruction opcodes while the EU works on the previous ones. The only drawback to this arrangement is that the BIU only has a 4-byte buffer (the prefetch queue). So it can only “cache” up to 4 bytes in advance to feed the EU. As you can imagine, it is empty most of the time, because instructions execute faster than they can be fetched thanks to the sucky access time I mentioned in the previous paragraph. The 8086, again, is a little better; it has a 6-byte queue that it can usually keep full thanks to it’s faster RAM access.

Register specialization. The 8088 has four general-purpose registers, but all four of them are tied to specialized functions that you can’t use if you’re using them for, well, general purposes. For example, it’s possible to do a tight loop of some operations, but that uses CX as the counter, so you can’t use CX inside the loop.

Advantages of the 8088

CISC architecture. CISC was all the rage at Intel in 1976, so they built in some “metainstructions” that do several things with a single opcode. For example, MOVSB will copy a byte from DS:SI to ES:DI, then advance both SI and DI so you can do it again, and it does all three in much less time than if you did it yourself. Couple MOVSB with the REP prefix and you can do this repeatedly at high speed. XLAT is another one, which will replace a register value with a like value in a translation table. We’ll cover some of the better ones later in this guide.

Multiplication and division. Unlike most home computer CPUs of its era, the 8088 has a built in MUL and DIV. If you need to do 16-bit multiplies or divisions where the operands/divisors are not known beforehand, nothing beats them. (If you need to mul/div smaller values, however, they’re slower than they should be and you can usually beat them using Quarter Square Multiplication or shift-and-adding.)

Register specialization. Hey, didn’t we mention this previously as a negative? Yes, but sometimes the specialization works in our favor. For example, some opcode encodings that use the accumulator (AX) are only a single byte in size, and as we’ll see later on, smaller is better. Heck, let’s start now:

Smaller Is Better

Because it takes 4 cycles to read a byte, and because the prefetch queue is so tiny, smaller code is usually better. When writing or optimizing your code, keep the following chart handy, which lists all general-purpose 8088 1-byte opcodes. (This is not a full list, just a list of what I find most useful. For a full list, read the 8086 Family Users Manual from Intel.) If you can replace something you’re doing with one of these instructions, it’s almost always a win.

Opcode	Instruction	Description
37	AAA	ASCII adjust AL (carry into AH) after addition
3F	AAS	ASCII adjust AL (borrow from AH) after subtraction
27	DAA	Decimal adjust AL after addition
2F	DAS	Decimal adjust AL after subtraction
98	CBW	Convert byte into word (AH = top bit of AL)
99	CWD	Convert word to doubleword (DX = top bit of AX)
F8	CLC	Clear carry flag
F9	STC	Set carry flag
F5	CMC	Complement carry flag
EC	IN AL,DX	Input byte from port DX into AL
9F	LAHF	Load: AH = flags SF ZF xx AF xx PF xx CF
9E	SAHF	Store AH into flags SF ZF xx AF xx PF xx CF
EE	OUT DX,AL	Output byte AL to port number DX
0E	PUSH CS	Set [SP-2] to CS, then decrement SP by 2
1E	PUSH DS	Set [SP-2] to DS, then decrement SP by 2
06	PUSH ES	Set [SP-2] to ES, then decrement SP by 2
16	PUSH SS	Set [SP-2] to SS, then decrement SP by 2
1F	POP DS	Set DS to top of stack, increment SP by 2
07	POP ES	Set ES to top of stack, increment SP by 2
17	POP SS	Set SS to top of stack, increment SP by 2
9C	PUSHF	Set [SP-2] to flags register, then decrement SP by 2
9D	POPF	Set flags register to top of stack, increment SP by 2
C3	RETN	Return to near caller (pop offset only)
D7	XLATB	Set AL to memory byte DS:[BX + unsigned AL]

Something handy to print out and keep next to you while you code.

Accumulating Speed

Register specializations suck, but when it comes to the accumulator (AX), Intel built in optimized forms of instructions that are one byte shorter, one cycle faster, or both. Try to reorganize your code so that AX or AL can be used for these optimized forms, especially in an inner loop: (“accum” here means either AX or AL, and “immed” means any immediate value)

Instruction	Description
ADD accum,immed	Add
SUB accum,immed	Subtract
ADC accum,immed	Add with carry
SBB accum,immed	Subtract with borrow
AND accum,immed	Logical AND
OR accum,immed	Logical OR
XOR accum,immed	Logical Exclusive-OR
IN AL,DX	Read from port
OUT DX,AL	Write to port
MOV mem,accum	Copy to memory
MOV accum,mem	Copy to register
CMP accum,immed	Compare (perform subtraction, but only set flags)
TEST accum,immed	Test (perform logical AND, but only set flags)
XCHG reg,AX	Exchange values

That last one is a doozy; XCHG reg,AX is 1 byte and 3 cycles. This was part of Intel’s plan to pair both LOCK and XCHG together as a way to implement atomic semaphores, so they optimized it in the silicon. (See page 2-18 of the 8086 Family Users Manual for details.)

It’s Called A Coprocessor For A Reason

If your project is going to work with floating point and needs both speed and accuracy, read this section. If you know you’ll never need that, skip to the next section.

Everyone knows that the 8087 math coprocessor is much faster than the 8088 if you need to perform IEEE floating point math operations (including square roots, tangents, arctangents, etc.). What people seem to forget is the word “coprocessor” in the name. It’s a true coprocessor, which means it can be crunching away on an operation in the background while the 8088 is off doing something else.

This is HUGE. If your program needs the accuracy of IEEE floating point and has to do a lot of difficult slow stuff with it, you can essentially get background computing of floating point for free. The sequence of operations for 8088 code is essentially this:

Load 8087’s stack with values
Give it a command
Go off and do whatever you want
When you’re ready for the result, issue a WAIT
Pop your result(s) off the 8087 stack

This almost feels like cheating.

Stringing Up The CPU

Another byproduct of Intel’s CISC rage were the string instructions. These are ludicrously powerful in the right circumstances, and you should use them whenever possible. Intel called them “string operations” because they were designed to assist in text string manipulation. Each of these opcodes are golden for three reasons:

They are 1 byte long
They perform multiple operations faster than the individual steps would take
They can be automatically REPeated without using any jump/loop instructions

Here are the five string instructions:

LODSB – Load byte from DS:SI into AL, then advance SI
STOSB – Store byte in AL to ES:DI, then advance DI
MOVSB – Copy byte from DS:SI to ES:DI, then advance both SI and DI
SCASB – Load byte from ES:DI and compare it to AL (sets flags equal to a subtraction), then advance DI
CMPSB – Compare byte from DS:SI to byte at ES:DI (sets flags equal to a subtraction), then advance both SI and DI

But wait, that’s not all! You also get 16-bit versions of the same instructions!

LODSW – Load word from DS:SI into AL, then advance SI +2
STOSW – Store word in AX to ES:DI, then advance DI +2
MOVSW – Copy word from DS:SI to ES:DI, then advance both SI and DI +2
SCASW – Load word from ES:DI and compare it to AX (sets flags equal to a subtraction), then advance DI +2
CMPSW – Compare word from DS:SI to word at ES:DI (sets flags equal to a subtraction), then advance both SI and DI +2

These can be called individually, but really shine when they are used with a REP prefix, which will repeat them for CX times (meaning, REP will run a string instruction, then decrement CX, then if CX is 0 it will stop). The last two are used with additional repeat prefixes: REPE/REPZ (repeat while equal/zero) and REPNE/REPNZ (repeat while not equal/not zero), so that the loop ends (or continues) based on the result of the comparison.

There’s a lot you can make fun of the 8088 for, but nobody makes fun of the string instructions. If you need to copy memory around, scan a buffer for a value, fill a buffer to a certain value, or compare two buffers for equality, they are an order of magnitude faster than doing things the long way. I mean, seriously, a single CMPSW done the long way would look like this:

PUSH AX        ; CMPSW doesn't change any registers, so we can't either
PUSH BX
MOV AX,DS:[SI] ; Load DS:SI somewhere
ADD SI,2       ; Advance SI
MOV BX,ES:[DI] ; Load ES:DI somewhere
ADD DI,2       ; Advance DI
CMP AX,BX      ; Do the comparison (sets flags equal to subtraction)
POP BX         ; CMPSW doesn't change any registers, so we can't either
POP AX

Maybe now you’ll understand why I love the string opcodes so much!

This is the end of Part 1 of our crash course. In Part 2, I’ll continue with various tips and examples. In Part 3, I’ll present a case study that shows what kind of benefit you can realize from taking the time to optimize for speed.

This entry was posted on January 10, 2013 at 3:00 pm and is filed under Programming, Vintage Computing. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

28 Responses to “Optimizing for the 8088 and 8086 CPU: Part 1”

Covoxer said

January 11, 2013 at 2:09 am
> By understanding both, it is possible to write assembly code that can run as fast as the best 6502 or Z80 code at similar clock speeds.

This statement is FALSE. With optimizations you can write code that is much faster than the best competing 8-bit CPU’s of that time. Writing lame code would result in about the same performance as optimized code for 8-bit CPU’s. I have already commented (with examples) this confusion of yours sometime ago here: https://trixter.oldskool.org/2011/06/04/at-a-disadvantage/

Also, can you name any other personal computers of that era running at “similar clock speed”?

> While other CPUs of the 1970s enjoy single-cycle access to a byte of memory, the 8088 takes 4 cycles to access a byte.

This statement is FALSE again! How many cycles (I suppose you actually mean “clocks”?) did it take to access memory in 8080 or Z80?

6502 had 1 clock access, but that limited most systems to 1MHz, resulting in slower memory access than 8088 (yes, the great C=64 had slower memory than the lame 8088 IBM PC !).
It is also important to mention that many of the 8-bit personal computers of the era shared RAM between CPU and Video, often resulting in tremendous drop in actual memory speed.

Your bashing of the 8088 is really unfair. In reality, IBM PC was at least as fast as any 8-bit personal computer of that time, even if you didn’t optimize the code. All these optimizations (most of them were not even considered to be optimizations, it was straightforward assembly programming, like using 8 bit memory addressing in 6502, which you’d hardly call an optimizing) resulted in much (in some cases – many times) faster code than any 8 bit PC could possibly achieve.

Reply
- Roy Jacobs said
  
  January 11, 2013 at 3:33 am
  How can you claim, in all seriousness, someone who is clearly in love with the platform to be ‘bashing’ it?
  
  Reply
  - Covoxer said
    
    January 11, 2013 at 5:04 am
    He, he. :-) Well, if this is the case, the reason may be to exaggerate something (the effect of the described optimizations for example – if you don’t use it, your code would be crawling slower than 6502 [lie]) or the overall success of the 8088 programmers (the slower PC was, the more impressive all the software running on it would look).
    I don’t know. But what I do know, is a real world performance of the 8088 in IBM PC and how it compares to the rest of the PC’s of the era. And it’s not hard to prove.
    Anyway, for whatever reason, Trixter is unfairly bashing 8088 here as he did here: https://trixter.oldskool.org/2011/06/04/at-a-disadvantage/
    Here’s quote: “The original IBM PC, despite appearances and bias on the part of both consumers and marketing, was actually the slowest popular personal computer on the market at the time of its release, even compared to the Apple II and Atari 400.”
    If that’s not bashing, then what is it?
    
    Reply
    - Trixter said
      
      January 11, 2013 at 11:53 am
      It’s meant to generate conversation and prove me wrong.
      
      Roy correctly read between the lines. I love the 8088. (Actually, I love the 8086 — the 8088 is hobbled.)
      
      Reply
      - Covoxer said
        
        January 11, 2013 at 12:55 pm
        So we have a conversation. :-)
        
        Yes, of course 8086 was much better. Well, at least 8088 had not suffered this bus truncation as badly as 68008 did.
        
        Reply
- Trixter said
  
  January 11, 2013 at 9:27 am
  I’ve updated the article to say “*faster* than the best 6502 or z80 code” since you feel so strongly about it. However, with the 6502 enjoying single-clock access to memory, I just can’t agree in spirit. 4-clock access to a byte on 8088 really hobbles the entire machine. Many common 8088 opcode forms are 2-4 bytes long; aren’t all 6502 opcodes 1 byte in size?
  
  Most people trying to program on 8088 for fun today are using compilers, so the code is fairly slow. Even compiled C code is slow. And if you decide to convert to naive assembler, it is STILL no guarantee it will run acceptably. That was the motivation for my guide, to show people that if you’re going to produce something fantastic on an 8088 PC, you’re really going to have to put the effort in.
  
  With the exception of Microsoft Flight Simulator, I would love to be proven wrong and shown some examples of games that ran significantly worse on C64/Z80 and/or significantly better on 8088. And remember, I’m talking about a 4.77Mhz 8088 with CGA, nothing faster. I really can’t think of any, with maybe the exception of Elite and maybe Stunt Track Racer.
  
  Reply
Covoxer said

January 11, 2013 at 2:21 am
> The only drawback to this arrangement is that the BIU only has a 4-byte buffer (the prefetch queue).

This is a drawback comparing to what? All the other competing CPU’s had 1 byte for these purposes. How can 4 bytes be a disadvantage to 1 byte?

> The 8088 has four general-purpose registers, but all four of them are tied to specialized functions that you can’t use if you’re using them for

It is a disadvantage comparing to what? All the competing CPU’s had accumulator ISA. How can accumulator based instruction set be better than what we have in 8088?
Besides, comparing to 8 bit CPU’s, you should mention that 8088 had EIGHT general purpose 8 bit registers: al, ah, bl, bh, cl, ch, dl, dh.

> For example, it’s possible to do a tight loop of some operations, but that uses CX as the counter, so you can’t use CX inside the loop.

Do you imply that 6502 and Z80 were better in this respect?

Reply
- Trixter said
  
  January 11, 2013 at 11:45 am
  You are taking my comparison to the 6502/Z80 at the beginning of the guide and applying it to the rest of the guide. This was never implied by me. As soon as the guide starts proper, I am no longer referring to other 8-bit platforms.
  
  Yes, 8088 had 8 8-bit registers. Didn’t 6502 have 256 registers using zero-page? If not the same thing, disregard.
  
  Z80 has similar registers to the 8088 and it also has an alternate set that you can swap in and out. Z80 was not limited memory-wise that the PC was, and ran at 3.5Mhz in common implementations (Spectrum). I think it’s a valid argument to say the Z80 was faster than 8088, but if I am wrong about the Z80, I don’t mind being corrected.
  
  I think you’re misinterpreting the spirit of the guide. The intended spirit of the guide is to list all of the common pitfalls and how to get around them, and to motivate people to WANT to get around them in an energetic writing style. I’m a demoscener, not business systems developer.
  
  Reply
  - Covoxer said
    
    January 11, 2013 at 1:14 pm
    Ah, you meant disadvantages to 8086. Sorry, got you wrong.
    
    No, 6502’s 256 registers are not the same since they require extra memory access and you can’t store op result in any of them, only in accumulator.
    
    Yes, Z80 at 3.5MHz is much closer to 8088 at 4.77MHz. But I wold still vote for 8088 in general case. Z80 was still an accumulator based ISA. It had next to no 16 bit support. It had less instructions (no multiplication for example). The second register set was useful for quick context switching but was next to useless in means of doubling number of general purpose registers since you couldn’t easily use both banks simultaneously, only through memory. Also 8088 had more advanced addressing modes. And it had no 4 bytes queue. In our “in disadvantage” discussion, I have posted three more or less useful code samples. In two of them, Z80 was as fast as 8088 (3.5 vs 4.77 MHz), in the third one, Z80 was about 15 times slower due to microcoded multiplication in 8088 (all that comparing 3.5MHz Z80 with 4.77MHz 8088).
    Also, Drystone was about 4 times slower on 4MHz Z80 than on 4.77MHz 8088. Basically, suggesting that one could bother less about code optimization on 8088 than Z80. ;-)
    
    Reply
    - Trixter said
      
      January 11, 2013 at 1:38 pm
      In the third part (which I’m still writing; part 2 is done and will be posted soon), I plan to cover how I was able to start with an assembler routine and, through iterative redesign and changes, was able to speed it up roughly 25%. I think that 25% is significant enough that it is worth optimizing 8088 code. Unfortunately, as most software written suggests, most people thought 8088 was “good enough” and didn’t spend too much time optimizing.
      
      I’ve seen C64 demos that display a rotating environment-mapped torus. Yes, it’s roughly 2fps, and I know its effective resolution is probably 80×50 or 40×25, but it’s still realtime. It makes me weep for my platform since I can’t see how that is possible on 8088.
      
      Reply
      - Covoxer said
        
        January 12, 2013 at 1:47 am
        Can you give me a link to that C64 demo you are talking about?
        I’m pretty sure that if it doesn’t use any VIC II features it can be done on PC (except for CGA colors limitation that is).
        
        Reply
        
        Trixter said
        
        January 12, 2013 at 12:17 pm
        I’m afraid I can’t find the name of the demo, although Mathematica by Reflex has a gouraud-shaded (no textures) torus at a decent framerate around the 7:15 mark.
        
        Reply
        
        Covoxer said
        
        January 13, 2013 at 2:36 am
        The environment mapped sphere was a famous Amiga demo. Not sure about environment mapped torus on C64.
        
        The torus in Mathematica is not a rendering of the 3D object (obviously). It is simply a plot of the torus equation (where Z is used for the pixel color). With the use of lookup tables it can be very quick (like rotating cubes or fractals). No doubts this can be implemented on IBM PC (running in text mode – you’d have about the same resolution ;-) ).
        
        Trixter said
        
        January 13, 2013 at 11:13 am
        Mathematica was a bad example, sorry. For a real one, look at numen by taQuart. Some of the stuff in there is amazing for 1.77 MHz. They were very kind to release the sources, so I’ll have a look (now I have to learn 6502!)
        
        How can you use lookup tables to plot fractals? I’ve seen some neat julias on 8bit (both orbitals and the normal plot) but I can’t see how to use lookup tables to speed them up. I can draw one frame of mandelbrot 40×25 in about a second using 32-bit integer math…
        
        Reply
        
        Covoxer said
        
        January 14, 2013 at 12:45 am
        He, he, let us know about your findings! :-)
        
        I was not clear about fractals. I was talking about that rotating fractal textured cube in Mathematica. Not sure about using lookup tables for mandelbrot rendering, but isn’t 32 bit integer an overkill for 40×25?
        
        Trixter said
        
        January 16, 2013 at 7:14 pm
        A discussion of what 8-bit demos have (do not have) environment mapping, and how it might be achieved, is now here: http://www.pouet.net/topic.php?which=9205
Optimus said

January 11, 2013 at 8:30 am
Wow! I will keep reading these. There are enough opcodes and information I didn’t even know before. I wonder how these extrapolate to 386 programming. I say this because this is the oldest PC I have in my room right now and would like to learn how cycle counting works there, if it’s necessary, where I can gain from. But then again, I can read some chapters from Abrash’s Black Book for this and keep reading your guide :)

Reply
- Trixter said
  
  January 11, 2013 at 12:19 pm
  Some of the advice in the guide applies to all optimization (such as reading/writing memory as little as possible), but be careful — much of the guide is appropriate for the 8086 and 8088 ONLY. For example, you definitely do NOT want to use XLAT, LOOP, or any of the BCD stuff on Pentium and higher as they are much slower. Even the very next CPU up, the 80186, has features like shifting and rotating using immediate values, and on the 80286 and higher the MUL/DIV are much much faster and you should use them for almost everything possible.
  
  Once you hit 386, it’s a whole new game. 32-bit-wide registers, additional segment registers, more addressing modes (which means LEA gains tremendous power), and if you work in protected mode you have true 32-bit pointers (and more speed, since you don’t need 66h//67h modifiers in front of everything). There’s a lot you have to learn/re-learn. If you plan on targeting a 386 or higher, this guide is NOT what you should be reading (except as a historical curiosity, or maybe to get some ideas that *do* apply to your target platform).
  
  Reply
  - Purp (@PurpAv) said
    
    April 23, 2013 at 2:45 am
    The 386SX is so severely memory constrained that many of the 8088 style size optimizations that would result in a loss on 386DX become a win on the 386SX. Curiously, in the modern x64 CPU bloatware era, aggressive space optimizations may actually run faster than their book numbers would indicate. If you can keep code in the small L1’s its going to be like 8-10X faster than a fully unrolled size porked “speed optimized” version that blows out the L1 and winds up flogging the L2.
    
    Reply
    - Trixter said
      
      April 23, 2013 at 8:50 am
      This supports my favorite Terje quote: “All programming can be viewed as an exercise in caching.”
      
      Reply
Optimus said

January 11, 2013 at 8:36 am
Then again, do you know any PC emulator that tries to be more precise in cycles and can target a 8088/8086?
I read all these discussions whether Z80/6502 or 8088 is better and as a Z80/6502 coder it makes me curious to try coding something on the 8088 and compare, but in an emulator that I can test performance fairly correctly. Dosbox wouldn’t make it for this.

Reply
- Trixter said
  
  January 11, 2013 at 11:46 am
  So far, PCem is the only emulator I’ve used where the author made a conscious effort to be cycle-exact. The end result is not quite cycle-exact, but there is a lot of attention to detail (he attempts to emulate CGA snow, for example) and it is as good as you will get without being in front of a real machine. Link to PCem is in my previous post.
  
  Reply
Optimizing for the 8088 and 8086 CPU, Part 3: A Case Study In Speed « Oldskooler Ramblings said

January 18, 2013 at 11:01 pm
[…] Optimizing for the 8088 and 8086 CPU: Part 1 […]

Reply
LZ4 on the 8088: One small drop « Oldskooler Ramblings said

February 9, 2013 at 11:51 pm
[…] Optimizing for the 8088 and 8086 CPU: Part 1 […]

Reply
Mikkel Christiansen said

March 11, 2013 at 5:44 pm
PUSH AX ; CMPSW doesn’t change any registers, so we can’t either
PUSH BX
MOV AX,DS:[SI] ; Load DS:SI somewhere
ADD SI,2 ; Advance SI
MOV BX,ES:[DI] ; Load ES:DI somewhere
ADD DI,2 ; Advance DI
CMP AX,BX ; Do the comparison (sets flags equal to subtraction)
POP BX ; CMPSW doesn’t change any registers, so we can’t either
POP AX

You don’t seem to take your own advice.

PUSH AX ; CMPSW doesn’t change any registers, so we can’t either
MOV AX,DS:[SI] ; Load DS:SI somewhere
ADD SI,2 ; Advance SI
CMP AX,ES:[DI] ; Do the comparison (sets flags equal to subtraction)
LAHF ; Save flags
ADD DI,2 ; Advance DI
SAHF ; Restore flags
POP AX ; CMPSW doesn’t change any registers, so we can’t either

Reply
- Trixter said
  
  March 14, 2013 at 8:47 am
  Nobody’s perfect. Thanks for the correct CMPSW expansion.
  
  Reply
May 2015 Links, Part 1 | Wayward Code said

May 15, 2015 at 10:37 am
[…] Optimizing for the 8088 and 8086 – I love retro computing articles like this […]

Reply
Intel: Thinking Different About CPU Design | Low End Mac said

June 23, 2016 at 8:23 am
[…] already done on the project). Almost every clone maker chose, not the more efficient 8086 with its 16-bit data bus and larger prefetch queue that resulted in a 50% reduction in I/O […]

Reply

	Matthew Garrett: Wha… on 8088 MPH: We Break All Your…
	The Incredible Demo… on 8088 MPH: We Break All Your…
	Trixter on 8088 MPH: We Break All Your…
	wh0phd on 8088 MPH: We Break All Your…
	John Olson on Cyberpunx

Oldskooler Ramblings

the unlikely child born of the home computer wars

Recent Posts

Recent Comments

Pages

Meta

Top Posts

Archives

Blog Stats

Optimizing for the 8088 and 8086 CPU: Part 1

Disadvantages of the 8088

Advantages of the 8088

Smaller Is Better

Accumulating Speed

It’s Called A Coprocessor For A Reason

Stringing Up The CPU

Share this:

Related

28 Responses to “Optimizing for the 8088 and 8086 CPU: Part 1”

Covoxer said

Roy Jacobs said

Covoxer said

Trixter said

Covoxer said

Trixter said

Covoxer said

Trixter said

Covoxer said

Trixter said

Covoxer said

Trixter said

Covoxer said

Trixter said

Covoxer said

Trixter said

Optimus said

Trixter said

Purp (@PurpAv) said

Trixter said

Optimus said

Trixter said

Optimizing for the 8088 and 8086 CPU, Part 3: A Case Study In Speed « Oldskooler Ramblings said

LZ4 on the 8088: One small drop « Oldskooler Ramblings said

Mikkel Christiansen said

Trixter said

May 2015 Links, Part 1 | Wayward Code said

Intel: Thinking Different About CPU Design | Low End Mac said

Leave a comment Cancel reply