Optimizing for the 8088 and 8086 CPU, Part 3: A Case Study In Speed
Posted by Trixter on January 18, 2013
In this final part of 8088 optimization posts, I present a case study for 8088 optimization. Specifically, I’ll cover a problem that I needed to solve, how I solved it, how long it took to optimize my solution for 8088, and what the performance benefit was from that optimization.
For the TL;DR crowd who will likely skip to the end of the article: Through clever 8088 optimization and a thorough understanding of our problem, a decompression routine was created that not only beats all known methods previously created for 8088, but can actually exceed the speed of a memcopy given the right input. But to see how this was achieved, you’ll have to grab a snack and settle down for 10 minutes.
Now, on with our case study.
My interest in speed optimization stems from being a lifelong fan of the demoscene. I’ve written some productions and unleashed a few cheap tricks here and there, but my core strength has actually been “glue” — understanding the underlying hardware and marrying the code to it. For example, while I coded a few effects in Explicit, my core role was actually assembling the demo itself — the timer/interrupt system, the music library, the 3D library, asset management, decompression, debugging, event handling, etc. all fell to me. And that’s fine, I enjoy doing it. (Next time you run Explicit, hold down the ALT key and you’ll see some realtime debug stats overlayed onscreen while the demo is running.) So this is why I tend to write the underlying framework of a demo before actually writing the individual effects, because if the framework is solid, the rest is “easy”. To me, anyway.
For many years, I’ve been threatening to write a proper demo for the 8088. 8088+CGA has been done a few times back in the day by Sourcerers, and there was a nice slideshow+music demo done for the Tandy 1000 as part of the 8086 compo, but there really hasn’t been a proper demo for the 8088 — 3D, graphics hardware tricks, nice design, the works. Every year I get closer, writing another piece of something that will eventually turn into a demo. A month ago, that piece turned out to be a need for fast decompression. (What I am going to use fast decompression for is left as an exercise for the democoder!) And by fast, I mean fast enough to be considered transparent: The user/programmer should not be aware that the decompression step took any time at all. Democoder parlance for this is “realtime” decompression.
So, a proper definition of the problem we’re trying to solve would be:
- We need to store data on disk in a form that takes up less space than its normal working state (ie. compression must actually occur — you can’t shrink 16 bytes out of a 64KB file and then call it a day)
- To be useful for most situations, we must be able to decompress the data with a minimum of work for the programmer (ie. one simple call with no special preparation or preexisting state required)
- Decompression must be fast enough to be considered transparent to the programmer (ie. take not much more time than a straight copy of the uncompressed data would)
For that last stipulation to be quantifiable, we need to decide what “fast enough” is. For our purposes, we will define “fast enough” to be roughly 4x slower than a straight memory copy (which 8088/8086 implements as a REP MOVSW). That kind of speed is respectable on any platform, so this seems like a fair starting point to gauge our progress.
The (de)Compression Method
So, what compression format will we write a decompression routine for? What formats lend themselves to fast decompression? What formats are even possible to decompress on our lowly 8088 with RAM that is measured in kilobytes instead of gigabytes? The answer, it turns out, is obvious — but only if you know what to look for. Like compression itself, we have to take a long time researching our problem before coming to a short conclusion.
A Quick Primer
Before picking a compression format to use, I will offer this extremely short primer to readers unfamilar with compression basics, so that the rest of the article will make sense. If you’re already familiar with what compression is, skip this part.
For our purposes, compression is the process of removing the redundancy out of a file so that it consists of only the unique parts and not the repeated or redundant parts. Consider this sentence:
She sells seashells by the seashore. (36 characters)
You can see that both “sells” and “seashells” share a sequence of letters (“ells “), as does “seashells” and “seashore” (“seash”). If we notice this and assign a custom code to both pieces (ie. AA=”ells ” and BB=”seash”), and then replace those portions of the sentence with the codes, we can shorten the space the sentence takes up:
She sAABBAAby the BBore. (24 characters)
Was actual compression achieved? Yes, because while the codes and what they represent take up space, we got to use them twice each. A code takes up two characters, but it represents five characters. So the sentence+codes take up 24 characters in compressed form, and the code definitions take up 10 more characters, so our total space usage is 34 characters compared to the original 36.
Saving two characters is nothing to write home about, but that was a single sentence. When you extend the idea to an entire novel, you begin to see how the coded sequences can get used more and more often until you have real space savings on your hands. We only used our codes twice, but across an entire novel, and combined with other codes, they’ll get used thousands of times, saving bytes here and there each time.
A Quick Primer (For Programmers)
There are many ways to compress data but the fastest methods are almost universally based on LZ77. LZ77 produces a compressed format that boils down to two types of sequences:
- Data that must be inserted directly into the target buffer (because we couldn’t find smaller ways to represent it), called LITERALS
- Data we already have somewhere that can be copied into the target buffer, called MATCHES
In our quest for high-speed decompression, we will only be considering LZ77-style compression methods. Not only are they fast to decode, but they’re also very easy to understand.
Again, the above is oversimplified for our purposes. For an excellent write-up on compression methods for low-resource platforms such as 8-bit CPUs, consult Pasi Ojala’s pucrunch page.
What To Compare Against
If you asked most DOS users in the early 1990s what the fastest compression/decompression software was, most of them would likely answer PKZIP. PKZIP was popular precisely because of its speed and reasonably good compression, and was so popular that it’s a format much of the world still uses more than 20 years later. PKWare released a Data Compression Library (DCL) of assembler-optimized routines for compression and decompression in 1991 for commercial use, which perform at roughly the same speed and ratio as PKZIP.
If this library exists, why are we bothering to write our own decompression routine? Why not just use the DCL for our needs? We’re writing our own because we want to explore what the state of the art can be for the 8088, and also because we aren’t sure what the state of the art actually is. Since the DCL was considered by many to be the best overall solution in terms of compression ratio and speed, that is the benchmark we will compare ourselves to. (It also helps that the DCL’s execution speed can be exactly timed just like we will be timing our code — more on timing later in this article.) In other words, if we can beat the DCL, we must be doing well.
The Current State Of The Art
There are many high-speed compression/decompression open-source libraries in use today. The oldest one still in wide use is UCL, written in the mid 1990s; there are newer ones written in the last five years that routinely compete for the top speed spot, such as QuickLZ, Snappy, and LZ4. My original idea was to examine all four of these to see what made the most sense to implement on the 8088. UCL was appealing because it had assembler code already written for the 8088/8086 as part of the library, but it was full of conditional branches and bit shifting/isolation, something the 8088 just isn’t very good at; there is also no published UCL data format (as far as I could find) so if I ran into issues, I’d likely be out of luck. QuickLZ is C code that assumes the underlying architecture is 32-bit or better, so while I could have written a decompression routine for QuickLZ data, it would have involved a lot of 32-bit memory values, adding to the complexity of the implementation. So both of these were out pretty quickly.
Snappy is Google’s library, and it looked very promising, mostly because it was implemented the way I would have written my own compression format: Simple code expansion, variable lengths, codes that are packed into data aligned by bytes. Source and format are fully documented, easy to understand, and in use by Google. No real downsides, right?
Snappy was going to be my first choice, but then I researched LZ4. As soon as I saw LZ4, I knew immediately it was the correct choice for the 8088 and got very excited. Not only was it the right choice for ease of implementation on 8088, but a unique design decision for the format lent itself to speed that I hadn’t seen before. Here’s what got me so excited:
Interleaved literal-match structure: This is the innovative discovery of LZ4. Yann Collet (the author of LZ4) realized that all compressed LZ77 data consisted of both literal runs and match runs, so rather than try to devise complicated codes that branch based on run type, he organzied the format so that there are always literals, then matches, then literals, then matches, etc. repeating until the end of the compressed data. This eliminates a lot of literal/matchcode-based branching. There are really only three branches ever required for decompressing LZ4: Two length calculations (and only if they’re needed), and a check to see if we’re out of codes to expand. Those three are the only major conditional branches! This is REALLY HUGE and is why LZ4 is at the top of the speed charts for nearly every platform.
64KB window size, 16-bit offsets: A 64K window size means that decompressing LZ4 doesn’t need any special handling — if you only have 64K to decompress (likely on our platform), you don’t have to worry about switching segments after the first 64K. And a 16-bit offset is a single word read, no unpacking. It’s like LZ4 was specifically created for Intel memory models, but it wasn’t — it was actually created in the very early 1990s for Yann’s HP48 calculator, which used a CPU based on nybbles. Which explains the code packing in the next advantage:
Simple codes: The literal lengths and match lengths are, if 14 or lower, stored as two nybbles in the same byte. That means both can be read with a single byte read and then cached as necessary. Even if the lengths are larger than 14, it only takes a few byte reads to get the full length. No wacky code unpacking like LZS or higher LZO levels.
No setup: All it takes to decompress one buffer into another is both buffers — no hash tables to initialize, no dictionary to load or build, no need to know the length of the decompressed size beforehand, etc.
Optimal Parsing: This is the secret sauce. It has nothing to do with how we will write our decompression routine, but it does have to do with decent compression ratios. And also speed, because LZ77-style compression schemes decompress in mostly linear time. Meaning, the smaller a file is, the faster it will decompress (ie. there is less actual work to do). LZ4_HC is the LZ4 variant that will take as much time as it needs to find the absolute perfect representation of the source data given the supplied encoding, and it is included in the LZ4 library and example programs.
All of the above makes LZ4 a winner for our architecture. But is it a winner for our workflow? If LZ4 was proprietary and difficult to use, that could be a problem. With stable vetted code, a BSD license, and Windows and Linux binaries for the compressor, it is a good fit for most cross-development workflows. (Most modern hobbyist retroprogramming is cross-development, with only speed-sensitive code requiring the actual hardware for testing.)
A Naive Implementation
Starting with only the format description and format diagram, I was able to code up a naive but functional decompressor in about half an hour in assembly. Normally I would have written it up in a HLL first to make sure my understanding was sound, but the format was so simple I thought that assembly would have been just as fast to code, and it was. I won’t reproduce 200 lines of assembler code here (a link to the finished code will come later), but it might help to know how I organized the routine:
DS:SI – Location of compressed data (the source)
ES:DI – Location of decompressed data (the target)
AX – scratch
BX – translation table
CX – loops, literal/match run lengths
DX – “cache” for literal and match length token
BP – size of compressed data (so we know when to stop!)
Here is what compressed LZ4 data looks like:
Yes, it really is this simple, which means you should be able to code up a naive implementation just from looking at the above diagram and reading the format explanation.
The #1 suggestion I can give anyone trying to optimize code has nothing to do with either assembler or the 8088: Benchmark, benchmark, benchmark. Know exactly how fast your code runs on the hardware you are targeting. If you don’t, you don’t have a basis for comparison and are flying blind when you optimize.
Following my own advice, I wrote a test harness for the decompression code and linked it with a modified version of Michael Abrash’s Zen Timer, which gives nearly perfect microsecond timing of sections of code using the 8253 PIT built into (or emulated in) every PC. I then created a variety of “test subjects” to run through the decompressor, and made sure the test harness compared the output of the routine with the original source material to verify it wasn’t getting mangled on output. I chose three typical files to do most of my timing with — “typical” in that they were the most likely things I would be decompressing: Compiled code, plain-English text, and a hand-drawn CGA picture. Each had their strengths and weaknesses and provided different stresses to the routine: The picture had a lot of large matches, the text had a lot of small matches, and the binary didn’t have much of either. The picture also had a lot of “runs” where a single value was repeated. Measuring with the Zen Timer, all three datatypes took a total of 530434 microseconds (usecs), or 0.53 seconds, to decompress.
After checking the output was correct for my test data, I then proceeded to write a harness for PKWare’s DCL that did the same thing: Compressed all three test subjects into its own format, then decompressed all three to memory and recorded the decompression time. While I knew the DCL traded some size for speed, it was still the fastest general-purpose algorithm in use at the time so I was expecting some real competition. To my surprise, the DCL took 7.75 seconds to decompress the test data — over 14 times slower!
Thoroughly satisfied, I almost stopped right there, as my goal of beating the DCL was already accomplished. But I just couldn’t help thinking what a proper optimization would do, and exactly how both perform on more different types of data (was my sample size too small?), so I decided to go through with the optimization exercise anyway.
Chapter 27, In Which It Blows Up In My Face
I’m not going to cover every optimization in meticulous detail, because that would likely lose what few readers have made it to this point. However, one optimization attempt deserves special attention for the fact that it uncovered a bug in all Intel x86 CPUs made through the 80286, and illustrates the importance of having a proper test harness for your code.
Looking closer at how I chose to copy matched data around in the ES:DI destination buffer, it seemed wasteful to preserve DS, point DS to the match, then restore DS when I was only copying a handful of bytes in some cases. Most memory operations on 8088 can take a segment override, even the string operations in some cases, so with a single prefix you can turn this:
MOVSB ; Copy a byte from DS:SI to ES:DI
ES: MOVSB ; Copy a byte from ES:SI to ES:DI
So, I got rid of the DS setup/restore code, changed REP MOVSB to REP ES: MOVSB, and patted myself on the back. Imagine my horror when a test run showed the output buffer had entire chunks that were either incorrect or in the wrong place. Worse, on a subsequent run, it was still wrong but in different places! I spent hours single-stepping through the code in a debugger, looking for things like uninitialized registers or incorrect compressed data. Finding nothing wrong, I then spent another hour testing my hard disk and RAM to see if my 30-yr-old IBM 5160 was finally starting to die.
Finding nothing, and noting that the code worked if I reverted my changes, I looked at REP ES: MOVSB and wondered if I was encoding it improperly. So I switched the order of the prefixes to ES: REP MOVSB. This resulted in the jaw-dropping result of mangling the entire output buffer. What the hell was going on?
With some research, and clarification from Robert Wessel on comp.lang.asm.x86, I confirmed that all CPUs from the 8086 through the 80286 have a bug where returning from an interrupt only resumes the most recent prefix, not both. So after an interrupt, a REP ES: MOVSB continues as ES: MOVSB which copies a single byte and then stops. My second attempt made things worse, as an interrupted ES: REP MOVSB continues as REP MOVSB which, without the ES: override, copies data from the wrong location!
One way to fix this is to disable interrupts before you run that code, then re-enable when done, but I didn’t want to do that because you never know if someone will need interrupts to run while decompression is occuring. (I know I do, anyway.) I decided to work around the problem like this:
again: REP ES: MOVSB ; could get interrupted and bug out... JCXZ next ; continue if REP completed LOOP @again ; keep trying if REP never completed next:
With a fix in place, and valid output confirmed, I timed the code. After all that heartache, I found that the overall routine slowed down by 8209 cycles, because the ES: override adds two cycles to every byte copy. The entire concept was a bust, so I reverted the code.
This just goes to show you why timing your code is important. In your head or on paper, some optimizations can seem like a win, but conditions outside your understanding or control can affect the final result.
Some optimizations, using the information provided in part 1 and part 2 of this guide, were easy to implement and usually didn’t result in a lot of code shuffling around. Here’s an abbreviated table of what was done and how many cycles it saved. (Remember, the three test cases took roughly 500000 usec to execute, so every 5000 usecs saved is 1% shaved off the execution time.)
|Replacing “ADD CX,4” with “MOVSW; MOVSW”||2573|
|Isolating upper nybble using a translation table instead of “MOV CL,4; SHR AL,CL”||4087|
|Rearranging register use so that AL/AX is used for scratch more than DX is||8414|
|Change “MOV DX,AX” to “XCHG DX,AX”||8712|
|Change “CMP CL,immed” to “CMP AL,immed”||3430|
|Rearranging length build to fall through for the common case, and jump only for the uncommon case||12146|
|Change all instances of “MOV reg,AX” to “XCHG reg,AX”||4028|
Not everything I touched turned to gold. Here is a summary of optimization attempts that failed and were subsequently rolled back:
- Eliminating a “nothing to do” check (turns out the check-and-jump saved us some time after all)
- Eliminating the REP MOVSW optimization f0r literal copies (I thought the special handling wasted time, but the MOVSW gains are worth the trouble)
- Handling very short literal lengths manually with individual MOV’s (as opposed to letting REP MOVS do it)
- Handling very short match lengths manually with individual ES: MOVSW statements (as opposed to letting the REP MOVS do it)
With 43390 usecs saved, I was a man possessed: I wanted to speed it up further, but had run out of 8088 tricks. The only way I was going to continue on this journey was to gain further insight into the data I was dealing with, so I wrote an LZ4 data analyzer.
Not So Obvious
One of my favorite quotes from Zen of Assembly: “The best optimizer is between yours ears.” It alludes to dumb/blind optimization — what I was doing up until this point, working like a human peephole optimizer — being inferior to “smart” optimization, where you analyze the problem itself to come up with different ways to solve it.
I had made some assumptions about the structure and frequency of LZ4 data, so to confirm those assumptions, I wrote a simple program to walk through my compressed sample set and print out some stats: What the shortest lengths were, the longest, the averages, and anything else that might give me some ideas. Some sample output:
====== robotron.lz4 Compressed size: 21040 Decompressed size: 40704 =====
Literals: Minimum: 0 Average: 3 Maximum: 385 Total: 11336 # of events: 3139
Matches: Minimum: 4 Average: 9 Maximum: 512 Total: 28634 # of events: 3098
1-byte runs: Minimum: 4 Average: 31 Maximum: 217 Total: 593 # of events: 19
2-byte runs: Minimum: 4 Average: 6 Maximum: 14 Total: 141 # of events: 21
Looking at various outputs, I was struck by how many match offsets of 1 and 2 there were. Offsets that small, with lengths larger than the offsets, essentially translate into RLE sequences. LZ77 code usually deals with this by just copying the same bytes over and over, but x86 architecture has a string opcode that can do this more efficiently: REP STOS. Put a value into AL or AX, a count into CX, and off you go, filling memory with that value until the count runs out. What is especially notable about STOS is that it is faster than every other instruction for putting a value in memory, faster than even a raw REP MOVSW copy. So if you can use it, you can exceed memcopy speeds for short sections. Adding special handling for offsets of 1 and 2 and redirecting them to special-case code that used STOS resulted in a larger-than-expected 32281 usec savings all by itself — that’s the largest single savings out of all the attempts — and it also resulted in a special surprise which I’ll cover later.
After I ran out of ideas looking at the analyzer output, on a hunch I looked a the LZ4 format description a second time and noticed that, by design, the last five bytes are always literals. The current structure of the code would check for this exit condition, then process matches, then jump to the next token and continue. By unrolling some code to move the “are we done?” test to the end of the loop instead of the middle, I was able to combine both the test and the loop jump into a single instruction. This saved cycles every iteration, leading to the second largest savings at 10946 usecs.
The analyzer also confirmed some of my suspicions, such as most lengths were 14 or less and would rarely need to jump to the “build a longer length” section of the code. By structuring the code such that the most common cases “fell through”, I was saving cycles each iteration.
So, when all was said and done, what was our time savings?
After all of the above work (which incidentally took about ten hours of free time spread out over a few weeks), we managed to take an already fast routine and shave 95856 usecs from it — an 18% improvement from the already fast routine where we started. Was 18% worth the effort? I definitely think so. It may not sound like much, but think about the savings like this: If a user would wait a full minute for decompression of a lot of data to finish, they now only have to wait 50 seconds. What was a 5-minute task now only takes 4 minutes. (And believe me, on an 8088, it wasn’t uncommon to be sitting in front of the machine waiting 5 minutes for something to finish.)
But that’s not the end of the story. Looking at more varied data than our three test subjects, we make an astonishing discovery (lower numbers are better):
|Data Type||Filename||memcpy() time in µs (REP MOVSW)||Decompression Speed (x slower than memcpy)|
|Small text file||text.txt||13716||3.17|
|Large text file||largetxt.txt||156457||3.17|
|Sparse compiled binary||robotron.com||112036||2.75|
|Dense compiled binary||linewars.exe||169924||2.16|
For data that has a lot of 1-byte and 2-byte runs — runs that we specifically catch and handle with REP STOSW, as noted previously — we see that decompression exceeds REP MOVSW copy time. It is actually faster to decompress such data than it is to merely copy it. That is crazy.
I’m still smug from the speed victory over the DCL, so let’s kick it while it’s down and compare its speed to ours (lower numbers are better):
|Data Type||LZ4 Decompression Speed (x slower than memcpy)||PKUNZIP “Explode” Speed (x slower than memcpy)|
|Small text file||3.17||61.21|
|Large text file||3.17||54.43|
|Sparse compiled binary||2.75||53.51|
|Dense compiled binary||2.16||70.08|
|Uncompressable data||1.39||116.14 (no special handling)|
Now, all this doesn’t mean anything if the actual compression LZ4 provides is horrible. Thanks to optimal parsing, it’s not — in fact, compared to the DCL’s Implode, it’s extremely close. LZ4 and Implode differ by only a few percentage points:
Conclusion and Downloads
Hopefully I’ve illustrated why optimization can be worth the effort. Having transparant decompression can solve a lot of problems in any program that has to manage a lot of assets. And maybe I’m crazy, but I found it extremely fun to see just how much blood I could squeeze from a stone.
Can the routine be sped up further? It can, actually, but doing so would require a lot of additional work on the part of the programmer (such as storing the compressed data on the stack and using POP to retrieve data). I also toyed with the idea of translating the LZ4 compressed data into a slightly different format that was more optimal to parse, but some quick tests showed that the transformed data is slightly larger as a result, which resulted in more data to read, which took up more time than the cycles saved by a more efficient implementation. About the only thing I can think of is that it might make sense to build literal/match lengths using REP SCASW looking for a word that wasn’t FFFF, but only if you knew had large lengths beforehand in your compressed data, which is unlikely.
So, as is, I believe this to be the very fastest implementation of LZ4 decompression possible on an 8088 CPU. All this in 436 bytes of compiled code (180 for the code and 256 for a translation table), and without requiring the programmer to jump through hoops to use it. This was a blast.
All code, binaries, and examples are available under what I’m calling the DSL (Demoscene License) so leech away. I’ve also made two versions of the Zen Timer available; an older package that extends support to C and Turbo Pascal, and a newer package that drops TP support but adds 32-bit protected mode C compiler support. Both packages have a “long timer” mode that can time periods larger than 54ms, albeit with a hair of jitter due to background interrupt noise.
Now, anyone want to help teach me fast 3D? This demo won’t write itself, ya know.