Oldskooler Ramblings

the unlikely child born of the home computer wars

8088 Domination Post-Mortem, Part 1

Posted by Jim Leonard on June 19, 2014


These next few posts are a technical write-up of how I created 8088 Domination, which is a program that displays fairly decent full-motion video on a 1981 IBM PC. Before reading further, it might be worthwhile to familiarize yourself with the production itself if you haven’t seen it. Here is video of what it looks like running on a 1981 IBM PC 5150 connected to a TV via the composite output jack:

…and if you doubt the above video’s authenticity, you can download the production and run it on your own vintage hardware as proof.

I would like to apologize in advance for switching between past tense (ie. my thought process during the design phase) and present tense (when describing how to achieve certain objectives) throughout this write-up.

Background

I’ve been fascinated by desktop digital video for decades. Starting with the first Video For Windows tools in the early 1990s, using early capture hardware (like the Media Vision Pro Movie Spectrum, which captured directly into MSV1/CRAM format), I’ve toyed with desktop video and its possibilities.

Of course, I’ve also been fascinated with demos since the early 1990s, and when you love two things that much, you find ways to mash them together. Many of my demoscene productions have had video as a part or whole of them in some way. Grind (1993) was a modplayer that displayed dancing Spring Break girls in perfect time to the music; the end-part of Explicit (1996) has dancing clubbers embedded in cubes, also dancing in perfect time to the music; and of course 8088 Corruption (2004) which displayed FMV on a 1981 IBM PC and took a chunk out of my 15 minutes.

8088 Corruption was not recognized for amazing quality video, but rather its novel solution to the problem of trying to display FMV on such a limited platform. That solution was to use 40×25 text mode, since that was the only mode represented in an amount of video memory that could be completely changed every frame without distracting artifacts or overtaxing the system.

I gave a talk in 2007 that explained 8088 Corruption in detail, and in that talk I explained that displaying FMV using CGA in graphics mode would be impossible. This is because CGA graphics mode uses 8x the amount of video memory that 8088 Corruption was handling. Even a simple calculation assuming 24fps video reveals that the amount of data needing to be updated per second (24fps * 16KB = 384KB/s) is outside of the IBM PC’s capability: CGA RAM can only be changed at a rate of 240KB/s, and most hard drives of the era operate at roughly 90KB/s. It sure felt impossible, so that’s why I said it.

Then I thought about the problem for 7 years.

Adjusting Expectations

I came to several realizations over the past few years thinking about this “impossible” problem. The most basic realization is that it isn’t always necessary to change every single pixel on the screen to show convincing FMV. If only a portion of the frame is in motion, such as in traditional animation, or a “talking heads” newscast, not every pixel is changing onscreen. But even if all pixels in a frame are changing, they might only be changing just a little bit… and if they are changing that little, the viewer might not notice if we cheat and don’t change them at all. So, we need to let go of the idea that we must change the entire screen every frame.  With the right picture content, we can get away with changing only a portion of screen memory, and it will look like we’re changing the entire screen.

delta_anim

Left: Black and White=differences, Grey=unchanged. Right: What the user sees after applying only the changes

Another realization was found re-evaluating what kind of bandwidth is available on the system. Previously, I was using the “40×25 text mode 2K of RAM” definition as the limit of what can be changed onscreen. But “2K of data changed” isn’t the real limitation; rather, it’s “not exceeding the time it takes to alter 2K”. In other words, the real limitation is not exceeding a specific time quantum per frame updating screen memory, no matter how much screen memory you’re changing.

This lead to a similar thought about disk I/O. In Corruption, I treated disk bandwidth as a series of fixed-length reads at a frequency of the framerate I needed to produce. As a result, Corruption uses fixed-length video+audio packets, so that (framerate * packetsize) is guaranteed to be less than the available disk bandwidth. But there’s no reason to adhere to that; the real limitation is the disk I/O itself, and I could read any amount of disk I wanted in a given time period as long as I didn’t exceed what the hardware could provide. This lead to the realization that video+audio packets could be of a variable length.

The combination of these and other ideas led to the design of a system that could fulfill the original requirement of “FMV in graphics mode”, which means we need a system that:

  • Finds differences between frames and updates screen memory with only those differences
  • Keeps track of how much time it takes to change screen memory and doesn’t exceed it
  • Keeps track of how much disk space is used and doesn’t exceed the transfer rates of the disk subsystem
  • Most importantly, degrades gracefully if we run out of CPU time or disk bandwidth before the source image is finished displaying

Because the IBM PC is not fast enough to make these decisions realtime (nor has the bandwidth to do so), all of these constraints needed to be handled during the encoding phase, when we have nearly unlimited time. That also means that most (if not all) of these constraints are really just data management problems, which I’ll get to later. But before we write even a single line of code, we need to back up a little and see how to convert 24-bit color pictures into something that doesn’t look like total crap on a CGA card.

CGA FMV Graphics Conversion

A quick primer on CGA: CGA was the first color graphics standard for the IBM PC, and contains 16KB of memory to hold screen contents. That memory can be presented as a few different graphics modes that appear different, but are actually all the same under the hood: 80 “byte columns” wide by 200 lines. Those 80 byte-columns can be sliced up a few different ways:

  • 640×200 with 2 colors (1 bit per pixel)
  • 320×200 with 4 colors (2 bits per pixel)
  • 160×200 with 16 colors (4 bits per pixel)

Let’s see what those look like in practice. Here’s a pretty hummingbird:

hum_original

…and here’s that hummingbird converted to our three basic choices without any fancy processing:

hum_160_nodither

160×200, 16 colors

hum_320_nodither

320×200, 4 colors, palette 1

640x200, 2 colors

640×200, 2 colors

 

Already the graphics programmers in the audience are screaming DITHER THAT SHIT, and they would normally be correct. However, most dithers that people are familiar with are error-diffusing dithers, which means that error differences are propagated throughout the entire picture. This is not a good idea for a system that is going to be recording changes between frames, because if a single pixel changes in the source material, every single pixel in the dither algorithm’s row scanning order after the change is going to change as well. What we need is a so-called stable or localized dither, where the value of each output pixel does not depend on its neighbors. That way, when only a local area of the source material changes, only that area of the destination dither will change as well.

While there are some modified, alternate dithers that keep error propagation to a spatial minimum (ie. Riemersma dither, which dithers along a Hilbert curve), the only truly localized dithering algorithm I know of is ordered dithering, sometimes called Bayer dithering. Some of the best publicly-available research on color ordered dithering is by Bisqwit (“Yliluoma’s arbitrary-palette positional dithering algorithm“) so I won’t try to explain the concepts here; consult that instead.

Applying even a rudimentary ordered dither to our sources now shows this for our hummingbird converted to our three graphics modes:

160x200, 16 colors, ordered dither

160×200, 16 colors, ordered dither

320x200, 4 colors, palette 1, ordered dither

320×200, 4 colors, palette 1, ordered dither

640x200, 2 colors, ordered dither

640×200, 2 colors, ordered dither

Much nicer, and we don’t have to worry about the dither introducing massive differences between frames if only one pixel changes in the source.

While all three of those modes look viable in their own way, for Domination I had to make the decision to discard the middle one (320×200, 4 colors). This is because the 16-color and 2-color modes display properly via the composite output jack on the CGA card, and I knew I’d have to present at the compo with the native hardware hooked up to a projector.  So my choices were limited to 160x200x16 and 640x200x2.

Managing Ch-Ch-Changes

The very fastest way to change screen memory on the 8088 is to either copy a linear sequence of bytes using REP MOVSB, or set a linear sequence of bytes all to the same value using REP STOSB. With this in mind, I chose to keep track of each change as a starting and ending offset of changed bytes, because that was closest to the exact method I would be using to “replay” those changes. I adopted a terminology to reduce confusion throughout the source code: I coined the term “delta” to refer to an area of changed bytes, and further classified deltas into “slices” (a delta where not all bytes are the same) and “runs” (deltas where all bytes are the same). This was important because “runs” can fill screen memory extremely quickly using REP STOSB, so it was worth keeping track of them. Here is the actual data structure I used:

  TDelta=object(TObject)
    startOfs:word;         {starting offset of delta in buffer (first changed byte)}
    endOfs:word;           {ending offset of delta in buffer (last changed byte)}
                           {rest are calculated during insert or other operations:}
    dtype:deltaType;       {slice or run}
    fillvalue:byte;        {for informational purposes only, not used for anything}
    blength:word;          {length of delta in bytes}
    numPixelsChanged:word; {# of pixels changed in the entire delta}
    REPcycleCost:real;     {different whether slice or run}
    frozen:boolean;        {flag used to control various optimization phases}

    Constructor Init(starto,endo:word);
    Destructor Done; virtual;
  end;

All of these deltas needed a good structure to collect and manage them, so I decided to use a Turbo Pascal 7 data structure called a TSortedCollection, which is analogous to a sorted linked list. Deltas were added to this list and maintained in sorted order using a variety of criteria in the following priority:

  1. Larger/Longer Deltas placed before smaller/shorter ones
  2. Given the same length, “runs” placed before “slices”
  3. All runs sorted by their fill value
  4. Given the same length, all deltas sorted by their interlaced start offset in memory

The sort order seems like an odd thing to mention here, doesn’t it? I’ve done so because it was a contributing factor to the success of the project, as it solved many data processing problems later on. It was so helpful that I think it’s worth going over in detail:

  1. Larger/Longer Deltas placed before smaller/shorter ones: As we process deltas from the top of the list to the bottom, the largest areas of the screen will change first until we run out of bandwidth. If we decide to completely jettison the list entirely before it is finished due to bandwidth starvation, only the smallest changes at the bottom of the list — the ones that don’t matter very much visually, as they only change a few pixels — are discarded.
  2. Given the same length, “runs” placed before “slices”: This gave priority to runs, because setting an area of screen memory with REP STOSB is both faster and smaller than trying to copy the same area with REP MOVSB.
  3. All runs sorted by their fill value: This came in handy during the encoding phase, where I was able to set a fill value once and then cache that value for subsequent runs if applicable. I would not have been able to do that if all the similar fill values weren’t right next to each other.
  4. Given the same length, all deltas sorted by their interlaced start offset in memory: This was an aesthetic choice; if I had to change an area where all deltas were exactly the same length (like a rectangular area of the screen), this allowed the changes to “paint” in somewhat linear top-to-bottom order.  This is necessary because CGA’s framebuffer is not organized in a linear fashion, but is interlaced.

For the curious, this is what the sort operator looked like:

Function TDeltas.Compare;
{
We want to maximize motion fidelity, so we are going to make our lives
easier and sort deltas by size.
Note that the comparisons are reversed -- that's because we want the largest
deltas at the beginning of the list.  (-1 means beginning of collection.)
- We add a second comparison so that deltas of the same length are sorted
by cycle execution time; this keeps runs prioritized over slices in terms
of what to prioritize if our bandwidth gets starved.
- We add a third comparison so that runs of the same run value
are kept together, so that we can cache the run value.
- We can optionally add a fourth comparison so that, all things being equal,
deltas are sorted by start offset compensated for CGA interlaced memory layout
(this is what I'm colloquially calling "Dave Murry interlace handling"
based on how Dave handled it in the PC booter Sierra Championship Boxing).
}

var
  k1so,k2so:word;

begin
  k1so:=PDelta(Key1)^.startofs;
  k2so:=PDelta(Key2)^.startofs;
  if MurryHandling then begin
    {if k1so > screenRAMsize div 2 then k1so:=k1so-(screenRAMsize div 2);
    if k2so > screenRAMsize div 2 then k2so:=k2so-(screenRAMsize div 2);}
    k1so:=k1so AND $1fff;
    k2so:=k2so AND $1fff;
  end;
  {sort by delta length}
  if PDelta(Key1)^.blength > PDelta(Key2)^.blength
    then Compare := -1
    else if PDelta(Key1)^.blength < PDelta(Key2)^.blength
      then Compare := 1
      {sort runs at a higher priority than slices}
      else if PDelta(Key1)^.REPcycleCost  PDelta(Key2)^.REPcycleCost
          then Compare := 1
          {sort runs by fill value}
          else if PDelta(Key1)^.fillvalue > PDelta(Key2)^.fillvalue
            then Compare := -1
            else if PDelta(Key1)^.fillvalue < PDelta(Key2)^.fillvalue
              then Compare := 1
              {sort deltas by start offset}
              else if k1so  k2so
                  then Compare := 1
                  else Compare := 0;
end;

With the data structures finalized, encoding one video frame’s worth of changes becomes a simple matter of doing the following:

  1. Scan the preexisting frame and new source image for differences, and insert each found difference into the delta list data structure
  2. Determine the amount of CPU time and disk bandwidth we have available to paint our changes (a “byte pool” and a “cycle pool”)
  3. For every delta in the list, starting with the first and iterating to the last:
    1. Estimate the cycle and byte cost to paint this delta
    2. If our cycle or byte pools won’t be exhausted by these costs, encode the delta to disk and then remove it from the list, then updated the pools
    3. If one of the pools will be exhausted, skip this delta and move to the next one, which will likely be smaller or faster (or both) and have a better chance of not exhausting a pool
  4. If we empty the list, we’re done with the frame
  5. If we reach the end of the list, we are out of “time” for this frame and need to either:
    1. Spend the next frame going through the list again to finish up so that the existing source image has a chance to fully display, or:
    2. Throw away whatever is left and start working on a new source image

That’s pretty much it. When I release the source in a few weeks, you’ll see that the code implements the above logic very closely. However, there’s one very important detail that I haven’t explained yet, and it’s critical to playback performance: What does “encode the delta to disk” mean exactly, and how do we play an encoded delta back to screen memory as quickly as possible? Head on over to the conclusion of  our thrilling tale, where I’ll explain how I came up with a unique solution to that problem.

42 Responses to “8088 Domination Post-Mortem, Part 1”

  1. Terje Mathisen said

    These posts have brought back a lot of memories! :-)

    I started out writing 8088 asm code in 1982, at that time this was pretty much required since we didn’t have any useful drivers for any of the hw on the PC.

    Your video code looks a lot like what we used to call compiled sprites. Even if your code could assume that every single target offset was statically known, the need to keep 8088 code extremely compact means that you have to use ES:DI as the screen buffer target at all times, right?

    As you wrote, from that point it becomes “simply” an issue of how to cheat intelligently. :-)

    I did something similar in a terminal emulator I wrote around 1983, where the screen buffer was maintained as a list of lines to avoid the need for block copies to scroll the display. Physical screen updates were limited to whatever would fit in the time budget, but as soon as there was a tiny gap in the incoming serial data stream the display would become perfect. The effective result was that nobody every realized that I didn’t always show every single intermediate update on the screen.

    Terje

    Oldskooler Ramblings wrote: > WordPress.com > Trixter posted: “These next few posts are a technical write-up of how > I created 8088 Domination, which is a program that displays fairly > decent full-motion video on a 1981 IBM PC. Before reading further, it > might be worthwhile to familiarize yourself with the production i” >

    • Trixter said

      If anyone could have figured out I was outputting compiled code before I had written it up, it would of course have been you :-) I’m flattered you approve. You, and Abrash, have always inspired me.

  2. WTF? I just tried to add a lengty comment and all I got was: “sorry, this comment could not be posted”.

  3. Luckily, it’s still in my clipboard, here is is again:

    The dynamic rendering is impressive. But what is the input? You certainly are not using some advanced compressed video format here. Did you use any “standard” format, though, or did you have to use your own encoding? I wonder if a format like the early Quicktime video formats would work here, but I guess you had to optimize the input to the max and couldn’t process anything “ordinary” here, right?

  4. Good job! I must say, you’re the first person to get me to load what is ostensibly a video file into a disassembler (hint, hint) ;)

    I did catch an encoder bug though. At 4:48 (right in the middle of Bad Apple), the screen should turn white and stay white – instead, a black frame got inserted there somehow. Encoder doesn’t like all-white frames? (yes, I may have abused Bad Apple for my own video processing experiments a bit too much… *cough*)

    • Trixter said

      Good catch on the “format” :-) And you’re right, my source files don’t go black but there is black at that point in the video. Good catch! I’ll see if I can fix the bug before releasing the code.

    • Trixter said

      Finally found the bug that created the black frame — it was in my bitmap reader. I was assuming that all B&W bitmap palettes were black, then white; it turns out that the conversion program I was using would sometimes swap palette index entries. Very strange, but I fixed my .bmp loader and it looks good now.

  5. nice one

  6. Peter Lund said

    Could you reuse (groups of) slices to cut back on the bandwidth requirements from the harddisk?

    • Trixter said

      Yes; doing so would be a crude form of vector quantization. However, doing so would also take up CPU time processing a list, even if it was just a table lookup, and CPU time (or fast memory) is something we don’t have a lot of on this platform.

  7. Couldn’t help but hack together a “player” for your format to understand it better: http://mrcn.st/t/domplay.c (horribly hacky; audio is just piped to stdout, no color rendering, lame FPS “sync”, but hey, it does mostly play).

    All in all, very cool trick :-)

    • Trixter said

      DANG complete write-up only a few hours old and already a generic C player that properly interprets the subset of opcodes I used. I tip my hat to you sir!

      • I wrote the player with the info from Part 1 only; my timestamp was 10:17am while Part 2 was posted at 3:13 pm :-) (I was bored that evening)

        I did hear “streaming code” on IRC in passing from someone who knew someone who listened to you at the competition though; with that tip, after staring at the file format it became pretty clear what was going on. It’s one of those times when after seeing someone do it you go “duh – of course, this is how you’d do it” – but it still takes a special someone to come up with it. Hats off to you for pulling this off!

        (FWIW, my own Bad Apple video demo involves lasers: http://youtu.be/5A9Eh6D-K_g though you may find the purely-code LASE demo more traditional: http://www.pouet.net/prod.php?which=55697 )

  8. Phil Mayes said

    You’re probably going to cover this in Pt.II, but I wanted to comment on disk I/O. Using the file system subjects you to seek and rotational latencies. You can minimize seek latency by ensuring the file is contiguous. Rotational latency is minimized by using a disk cache with read-ahead. If you want to go lower, you can find where the data is; initiate I/O; set up an interrupt on completion. This is better than a cache because you will avoid a memory copy. Source: I wrote a commercial disk cache in the 80s. Ask if you want more details.

    • Trixter said

      I agree with what you wrote except how read-ahead would help in this case. Memory accesses on 8088 are so slow that no matter how good the cache is, the double-buffering created by the cache cancels out any speed gains. In part 2 of the write-up, I mention that I aligned all reads to sector boundaries and also requested reads that were multiples of sector boundaries; this allows DOS to perform reads directly to my target buffer without any middleman, and is as fast as I could make it.

      I hear you about setting up the transfer and then just having it go in the background, but I needed to keep my production compatible with both the real vintage hardware as well as DOSBox (emulators), so I had to go through DOS.

      • Phil Mayes said

        I still think read-ahead will help you, but I’ll wait for full details in Pt.II to see.

        • Trixter said

          Sorry, I think our comments/replies are getting interleaved out of order :) A shorter answer to the read-ahead comment from me would be “My alignment to sectors and memory handling is already a manual read-ahead”. And I couldn’t code something that interacted with the disk controller directly because I wanted to keep it compatible with all vintage hardware as well as emulators.

          I’m curious, what disk cache did you code? If it was Lightning or Norton Cache, I tip my hat to you, as I have always admired those.

  9. […] 8088 Domination Post-Mortem, Part 1 […]

  10. You had me at “DITHER THAT SHIT”. Having grown up with CGA on an IBM 5150 (and as someone who has followed the progress of video encoding techniques and who loves to hack), this story warms my heart on so many levels. Commendable hacking, sir.

    • Trixter said

      Thanks for the kind words. We had the same experiences and have the same interests; I try to pick projects (and write about them) in a way that would impress “teenage me”. This project succeeded :)

  11. Very nice work. Welcome to the Bad Apple demo club! I’m the author of a Bad Apple demo for the TI-84+ SE calculator (https://www.youtube.com/watch?v=6pAeWf3NPNU). The screen is 96×64 and 1bpp, so it is nothing for the 15MHZ Z80 to completely overdraw while leaving enough time for audio. I’ve gotten requests to make a version for the TI-84+ CSE, which is a whole different challenge, as the screen is 320×240@16bpp, so that same CPU just doesn’t have the throughput to fully draw frames at 30fps. However, seeing this writeup has inspired me to look into doing something similar to make Bad Apple happen, even if it is without the audio and I have to use the 160×240 mode of the LCD controller.

    • Trixter said

      I’m really impressed with the music! Who transposed/converted that? Whoever did it did a great job. What are the capabilities of the TI sound hardware? If not multichannel, how are you producing multichannel music on it?

      • I posted a video of an earlier version to 4chan which did not sound very good, and a user there made me a remix. We’ll never know who exactly it was.

        The calculators have no proper sound hardware, but I was able to use the two I/O pins on the link port. Bit banging the sound probably takes 75% of the CPU time. Because I am playing two square waves on each pin, I need to have three different output states: 0, 0.5, and 1. The pins are digital, so I approximate 0.5 by switching between 0 and 1 at ~16KHz. If you are interested in these kinds of techniques, I recommend you look into 1 bit ZX Spectrum music.

        • Trixter said

          Familiar with 1-bit, thanks. Clever using the I/O ports!

          This is also one of the few times I’ve heard of 4chan being used for good instead of evil.

  12. i can only imagine how different the 80s would have been if this was around back then…. people would have been swapping porn with 5 1/4″ floppy disks! and movie releases with a 10lb box of floppies! although that probably would be untenable… probably the primary use would have been cartoon shorts and porn.

    but would it have been possible to ENCODE the video with the hardware at the time? would imagine they would have needed a mainframe….

  13. […] after [Trixter] said full motion video using the graphics mode of a CGA adapter was impossible, he’s improved on his earlier work. Now, it’s possible to display video at 640×200 resolution at 30 frames per second on a […]

  14. […] after [Trixter] said full motion video using the graphics mode of a CGA adapter was impossible, he’s improved on his earlier work. Now, it’s possible to display video at […]

  15. […] after [Trixter] said full motion video using the graphics mode of a CGA adapter was impossible, he’s improved on his earlier work. Now, it’s possible to display video at 640×200 resolution at 30 frames per second on a […]

  16. […] details start here, and are continued with a link at the bottom. The author does a great job of explaining how s/he […]

  17. Now someone ought to do a VGA ModeX version

    • Trixter said

      Definitely! But the system I describe in these posts may not be the best choice, as the higher number of colors increases the number of changes per frame dramatically. If I were designing a system for VGA, it would likely be based on vector quantization that would stream in codebook replacements over time. And, it would require at least a 286-class machine so that bandwidth to the VGA card were 16-bit instead of 8-bit.

      If 386 and higher were the target, Cinepak is pretty hard to beat in terms of decoding speed (it’s VQ-based). And it’s also worth looking into Smacker as well once you hit 33MHz or faster.

  18. […] gracias a la curiosidad del genial scener Jim Leonard (trixter), que redacta en su blog la complejidad de lograr tal hazaña en su demo 8088 Domination. Mierda, se me caen las […]

  19. […] at 30fps. Which runs great because it is black and white. [Leonard] wrote a really good article at oldshool so if you like little more technical info you should read that. Now the […]

  20. […] http://trixter.oldskool.org/2014/06/19/8088-domination-post-mortem-part-1/Blog entry – Part 1 […]

  21. O_O

  22. http://t0.rbxcdn.com/e998396ffba71b3b657ca10dc7186769

  23. Awesome!, I know for sure that the hardware was very limited at that time! Congratulation, you are a genius.

  24. […] Veja aqui todos os detalhes técnicos da façanha. Aparentemente, ele reinventou parte da tecnologia de compressão de vídeo MPEG. […]

  25. […] 8088 Domination Post-Mortem, Part 1 […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 335 other followers

%d bloggers like this: