When you reach the top, keep climbing

March 2011
S	M	T	W	T	F	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Posted by Trixter on March 15, 2011

(Rather than break up the discussion, I’ve edited this entry with the promised timing information at the end of the post.)

First off, you owe it to yourself to check out Paku Paku, the astonishingly great pac-man clone written by Jason Knight. Why astonishingly great? Because, as a hobbyist retrogaming project, it does everything right:

Uses a 160×100 16-color tweakmode on CGA, PCjr/Tandy, EGA, VGA, and MCGA, despite only VGA being capable of a truly native 160×100 resolution
Plays multi-voice sound and music through the PC speaker, Tandy/PCjr 3-voice chip, Gameblaster CMS, and Adlib (yes, CMS support!)
Runs on any machine, even a slow stock 128K PCjr
Has convincing game mechanics (ghosts have personalities, etc.)
Comes will full Pascal+ASM source code

This is just as good a job, if not better, than I like to do with my retroprogramming stunts. Very impressive work!

One of the things I love about coding for the 8088/8086 is that all timings and behavior are known. Like other old platforms like the C64, Apple II, ZX Spectrum, etc. (or embedded platforms), it truly is possible to write the “best” code for a particular situation — no unpredictable caches or unknown architectures screwing up your optimization. Whenever I see a bit of 808x assembly that I like, I try to see if it can be reworked to be “best”. I downloaded Paku Paku just as much for the opportunity to read the source code as for the opportunity to play the game (which I did play, on my trusty IBM 5160).

On Mike Brutman’s PCjr programming forum, a discussion of optimizing for the 8088 broke out, with Jason giving his masked sprite routine inner loop as an example of how to do things fast:

lodsw
mov  bx,ax
mov  ax,es:[di]
and  al,bh
or   al,bl
stosw

It takes advantage of his sprite/mask format by loading a byte of sprite data and a byte of the sprite mask with a single instruction, then it loads the existing screen byte, AND’s the sprite mask out of the background, OR’s the sprite data into the background, then writes the background data. It takes advantage of many 808x architecture quirks, such as the magic 1-byte LODS and STOS instructions (which read a word into/write a word out of AX and then auto-increment the SI or DI registers, setting up for the next load/store) , and the 808x’s affinity for the accumulator (AX, for which many operations are faster than for other registers). In the larger function, it’s unrolled, specialized for the size of the sprite. It’s pretty tight code.

However, one line (“MOV BX,AX”) bugged me, as it also bugged the author:

The sprite data format is stored as byteMask:byteData words which I point to with DS:SI for LODSW… which I then move to BX (which sucks, but is still faster than MOV reg16,mem; add SI,2) so I can use bh as the mask and bl as the data.

So, was that code “best”? Is there no faster way to write a masked sprite in 160×100 tweaked text mode on the 8088?

First, let’s look at his original code, with timings and size:

lodsw            16c 1b
mov  bx,ax       2c  2b
mov  ax,es:[di]  10c 3b
and  al,bh       3c  2b
or   al,bl       3c  2b
stosw            15c 1b
--------------------------
subtotal:        49c 11b
total cycles (4c per byte): 93 cycles

On 8088, reading a byte of memory takes 4 cycles, whether it’s “MOV AX,mem” or the MOV instruction opcode itself. That’s why smaller slower code can sometimes win over larger faster code on 808x. So it’s important to take the size of the code into account when optimizing for speed.

Some background knowledge of how Paku Paku works can help us: The game does all drawing to an off-screen buffer that mirrors the video buffer, and when the screen needs to be updated, only the changed memory is copied to the video buffer. Because Jason does all drawing to an off-screen buffer in system RAM, and the video buffer is smaller than the size of a segment, you have room left over in that segment to store other stuff. So if you store your sprite data in that same segment after where the video buffer ends, you can get DS to point to both screen buffer AND sprite data. Doing that lets us point BX to the offset where the sprite is (it was originally meant to be an index register after all), and use the unused DX register to hold the sprite/mask. We can then rewrite the unrolled inner loop to this:

mov  dx,[bx]     8+5=13c 2b ;load sprite data/mask
lodsw            16c     1b ;load existing screen pixels
and  al,dh       3c      2b ;mask out sprite
or   al,dl       3c      2b ;or sprite data
stosw            15c     1b ;store modified screen pixels
inc  bx          3c      2b ;move to next sprite data grouping
--------------------------
subtotal:        53c     10b
total cycles (4c per byte): 93 cycles

Although we saved a byte, it’s a wash — exactly the same number of cycles in practice. However, since he is already unrolling the sprite loop for extra speed, we can change INC BX to just some fixed offset in the loop every time we need to read more sprite data, like this:

mov dx,[bx+1]
(next iteration)
mov dx,[bx+2]
(next iteration)
mov dx,[bx+3]

By adding a fixed offset, we can get rid of the INC BX:

mov  dx,[bx+NUM] 12+9=21c 3b ; "NUM" being the iteration in the loop at this point
lodsw            16c      1b
and  al,dh       3c       2b
or   al,dl       3c       2b
stosw            15c      1b
----------------------------
subtotal:        58c      9b
total cycles (4c per byte): 94 cycles

We shaved two bytes off of the original, but we’re one cycle longer than the original. While the smaller code is most likely faster because of the 8088’s 4-byte prefetch queue, it’s frustrating from a purely theoretical standpoint.

Reverse-engineer extraordinaire Andrew Jenner thinks two steps ahead of me and provides the final optimization that not only gets the cycle count down, but frees up two registers (DX and SI) in the process. He writes only what is necessary, and since we need to skip over every other byte when writing in 160×100 mode, manually updates the DI index register to do so. The end result is obtuse to look at, but undeniably the fastest:

mov ax,[bx+NUM]  12+9=21c 3b ; “NUM” being the iteration in the loop at this point
and al,[di]      9+5=14c  2b
or al,ah         3c       2b
stosb            11c      1b
inc di           3c       1b
----------------------------
subtotal:        52c      9b
total cycles (4c per byte): 88 cycles

…successfully squeezing blood from a stone.

Is this truly “best”? I think so. But to prove it, we have to time the code running on the real hardware. Thanks to Abrash’s Zen Timer, we have the following results:

Jason’s original code as listed above, repeated three times to plot a 5×5 sprite: 48 microseconds
My code block, three times with [bx], [bx+1], [bx+2]: 41 microseconds
Andrew’s optimization, also written with [bx], [bx+1], [bx+2]: 37 microseconds

And just to make your head spin, check the comments for this entry — the resulting discussion shows that if you’re willing to rearrange both your sprite data and your thinking, you can get things even faster!

This entry was posted on March 15, 2011 at 7:03 pm and is filed under Gaming, Programming, Vintage Computing. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

24 Responses to “When you reach the top, keep climbing”

Andrew Jenner said

March 16, 2011 at 12:13 am
I think your timings are a bit off: “mov dx,[bx+NUM]” is 12+9 on 8088 because it’s a word transfer and stosw is 15c for a total of 94c.

Best I’ve got without compiled sprites (which I’m going to assume take up too much space) is 87c:

mov ax,[bx+NUM] 12+9=21c 3b ; “NUM” being the iteration in the loop at this point
and al,[di] 9+5=13c 2b
or al,ah 3c 2b
stosb 11c 1b
inc di 3c 1b
—————————-
subtotal: 51c 9b

This also frees up dx and si. Once again this doesn’t change the mask/sprite format though.

If you’re feeling really tricky (and your run is long enough to make it worthwhile) you could disable interrupts and use “pop ax” for the word load, saving 21 cycles.

I suspect that eliminating the off-screen buffer to video memory copy (and writing directly to video memory) is going to bring much bigger savings overall though, especially if you don’t do any sort of snow suppression.

Reply
- Trixter said
  
  March 16, 2011 at 11:40 am
  You’re right about STOSW, my bad. I need to revise my blog post and do some proper timings. I might completely erase and rewrite the above.
  
  BTW you don’t need to disable interrupts to use POP AX for loading data — you only need to disable interrupts to set up your data on the stack. Once that’s done you can keep interrupts enabled and POP away.
  
  Unfortunately, I fear that the time you save POPing data is lost getting the data set up into a buffer that can also be your stack. Or, you can store it that way (like, leave 64 bytes inbetween every sprite chunk so that the system can have up to a 64-byte stack while interrupts are enabled), but then you’re taking up so much space you might as well go with compiled sprites anyway.
  
  Reply
  - Andrew Jenner said
    
    March 16, 2011 at 2:02 pm
    Well, if you don’t disable interrupts your SS data below SP is going to get corrupted. So you do need to disable interrupts if you want to be able to use that sprite data again the next time you draw.
    
    Reply
    - Trixter said
      
      March 16, 2011 at 9:00 pm
      Once again, I have a brainfart. You’re right, the sprite data would get overwritten, so it would need to be copied to a temp buffer every time. And since the run is too short for this application, the copy would wipe out the POP time savings.
      
      Reply
- Chris said
  
  March 20, 2011 at 7:48 pm
  From the “its a small world file”,
  
  The CMS programming information booklet used to add support to Paku Paku was provided by myself on the Vintage Computer Forum. It came with a Soundblaster 1.5 with CMS upgrade that I won off ebay 10+ years ago…..after a bidding war with a Mr. Trixter.
  
  Reply
  - Trixter said
    
    March 20, 2011 at 9:02 pm
    Could we have a link to that file/forum posting? I’d like to add CMS support to MONOTONE but the PCGPE information wasn’t completely accurate if memory serves. I want to know everything the noise channels and envelopes can do, for example.
    
    Reply
    - Chris said
      
      March 20, 2011 at 9:51 pm
      http://www.vintage-computer.com/vcforum/showthread.php?23831
      
      Reply
Jason Knight said

March 16, 2011 at 10:35 pm
I’m not getting how a stosb OFF word-boundary is going to be faster.

You guys do realize that to work, that stosb would have to be off boundary, right? That’s why I went with STOSW in the first place.

On my tandy 1000 (8088D), that ‘optimized’ version comes in 5% slower than the original, probably because that version is not hitting the prefetch right AND the off-boundary penalty.

… since the value we’re changing is in AL after a WORD read… one little two little three little endians…

Though yes, that would be great if the byte being set was on the word boundary… The eight clocks saved by doing byte operations are unfortunately lost immediately not just by the word boundary, but by the extra inc operation. Sitting here figuring in the prefetch I’m seeing 58 clocks raw, somewhere around ~66 clocks with prefetch and code reads figured in… the original is 50 clocks and with prefetch in the equation it’s around 56ish.

Also, since DI needs to be pointing at ES, not DS, the and al,[di] is pulling the wrong memory location… so you need that extra 2 clocks for the segment override. You also fail to update BX, which means unless I’m going to unroll the entire loop…

See the full code for _tile5 for example:
```
	mov  cx,5

@tileLoop:

	lodsw
	mov  bx,ax
	mov  ax,es:[di]
	and  al,bh
	or   al,bl
	stosw

	lodsw
	mov  bx,ax
	mov  ax,es:[di]
	and  al,bh
	or   al,bl
	stosw

	lodsw
	mov  bx,ax
	mov  ax,es:[di]
	and  al,bh
	or   al,bl
	stosw

	add  di,154

	loop @tileLoop
```
That’s really not going to help a whole lot unless I further unroll — and that’s just too much code for something simple. The one where I’m doing two iterations inside the loop (_tile3) and only looping 3 times may be worth unrolling, but I’m really not sold on it. (especially since the next version of the code is going to set the vertical height to be blitted on the fly)

It might seem wierd to read and write a byte you aren’t even changing… but it actually ends up faster thanks to less values to play with, lack of EA calculations, lack of boundary issues, brute efficiency of lodsw/stosw, and the advantage of mov on an accumulator vs. AND…

Oh, and I’m not ungrateful on this — even if said code isn’t as useful as one would hope, you’ve gotten me thinking on ways to improve my tandy specific 160×200 port… which is going to become the official PCJr version since I just can’t trust the Jr. to handle the CGA text-mode code properly… Thankfully I now have the 160×100 backbuffer blitting to the 160×200 tandy/jr buffer just as fast (if not faster) than the textmode code since being pixel-packed STOSW can do four pixels at a time… even if I have to do a mov es:[di+bx],al before the STOSW to add the even numbered scanlines.
```
	mov  bx,$2000
	mov  cx,5

@tileLoop:

	lodsw
	mov  dx,es:[di]
	and  dl,ah
	or   dl,al
	lodsw
	and  dh,ah
	or   dh,al
	mov  ax,dx
	mov  es:[di+bx],ax
	stosw

	lodsw
	mov  dl,es:[di]
	and  dl,ah
	or   dl,al
	mov  al,dl
	mov  es:[di+bx],al
	stosb
	add  di,77

	loop @tileLoop
```
Wondering if it would be worth it to change the data format so that all sprites are 7 pixels wide (8 pixels/4 bytes storage size), just to simplify the code and get rid of the pesky stosb on the second internal iteration. (for consistency)…

Reply
- Jason Knight said
  
  March 16, 2011 at 11:11 pm
  Oh, and the reason that change to a larger byte-width might be better? I could change the mask interlace to every other word instead of every other byte… meaning I could do one AND and one OR instead of two each… though only the Tandy version would see any advantage from that.
  
  Actually, no… the back-buffer I’m rendering as pixel packed, so that could boost things more.
  
  Though really Paku is fast enough – I am more worried about my next game that’s going to have two or three times as many sprites on screen — though I will gain some speed by not having to preserve the playfield as it’s going to be simpler.
  
  Reply
- Jason Knight said
  
  March 16, 2011 at 11:33 pm
  Ooh, yeah.. that change to word-width mask interlace instead of byte-width makes the buffer blits:
```
	lodsw
	and  ax,es:[di]
	mov  bx,ax
	lodsw
	or   ax,bx
	stosw
```
  Which is a MAJOR improvement… Though it will mean I have to re-encode all my bitmaps.
  
  Reply
- Andrew Jenner said
  
  March 16, 2011 at 11:39 pm
  Oops – it’s easy to make code fast if you don’t mind it being broken… However, even with the extra byte and 2-cycle penalty for the ES: override, I think my version ought to be faster (either that or I’m not understanding how the 8088 works properly).
  
  The 8088 doesn’t have a word misalignment penalty does it? That doesn’t really make much sense with an 8-bit bus and makes even less sense with byte accesses.
  
  I’ve worked out (by hand) what I think the prefetch queue, Bus Interface Unit and Execution Unit are doing each cycle and (with the ES: fix), I get 68 cycles for your original, 61 cycles for Trixter’s version and 56 for mine. These correspond fairly closely with what Trixter measured using his machine and Zen timer. So it’s odd that you’re getting something different (maybe I’m misunderstand how the prefetch queue works and Trixter tested it on a v20 or something, and we only got the same answers by accident).
  
  According to the method I’m using, it seems that the 8088 is IO bound for almost every cycle, so the performance is directly proportional to the number of bytes read+written: 17 for yours (11 code + 6 data), 15 for Trixter’s (9 code + 6 data) and 14 for mine (10 code + 4 data). I put my working here (sorry if it’s incomprehensible) – I’d be interested to see how your analysis differs from mine. My email address is andrew@reenigne.org if you’d like to take it offline.
  
  I’m working on making a software implementation of this method in order to accurately predict the performance of 8088 code, so I’m gathering datapoints about how the hardware actually works (I don’t currently have an 8088 machine to try things on, and documentation seems to be a bit thin on the ground).
  
  I certainly don’t blame you for not wanting to unroll the whole thing though.
  
  Reply
  - Jason Knight said
    
    March 17, 2011 at 10:16 am
    First, I don’t mind keeping this online — no need to take it to e-mail. It’s a very complex subject and I’m sure all three of us are making little mistakes and miscalculations all over the place… Keeping it where people can find it may save someone time later on.
    
    Looking at your calculations I see one flaw right away; AX is the accumulator — just as 286/higher has no effective address calculations, the accumulator on the 8088 is immune to EA’s calc delays — which is why when possible it’s often faster to do two mov — mov accum,mem; mov reg, accum … than it is to move directly to a register, depending on the prefix. The more complex the EA, the more true this often is. As such the mov ax,es:[di] is still a flat 14 clocks — you don’t add EA to it… unfortunately this advantage is mostly limited to MOV, which is where that “and ax,es:[di]” might be at a disadvantage. (though that base 9 might make it a wash with the second opcode)
    
    … and screwy as it sounds even with the 8 bit data path, the 8088 is slower accessing bytes off boundary or words that cross word boundaries… and this DOES effect accumulator operations on some 8088’s but not others. Great example of this is the P8088 used in many 5150’s ends up a fraction slower than the AMD 8088D used in some 5160’s – though that penalty typically is offset by the changes to the memory system that reduces the refresh wait penalty. With the array of 8088’s from intel, AMD, OKI, NEC, etc… it’s best to assume that penalties like off-boundary and cross-boundary exist…
    
    Hell, I’m testing on a NEC 8088D-2, which is supposed to be a vanilla 8088 knockoff, but I wouldn’t bank on it. I could be optimizing to one target while neglecting the others.
    
    Reply
    - Andrew Jenner said
      
      March 17, 2011 at 11:07 am
      That’s very interesting – I didn’t know about the lack of EA cycles when using mov and the accumulator. I know about the A0/A1/A2/A3 opcodes which transfer between the accumulator and memory in 10/14 cycles without using an EA, but they only work for constant addresses, not [DI], so won’t be used here. However, if my calculations are correct, it shouldn’t make any difference whether this MOV is 14 or 19 cycles, since the bottleneck is the bus, not the EU. You can transfer at most 1 byte every 4 cycles, so code which transfers 17 bytes is never going to take less than 68 cycles.
      
      If I put together some testcases to try to measure the effect of using the accumulator on EA calculations and the effect of word-alignment on memory accesses, would you mind running them on your machine and sending me the results so that I can adjust my model?
      
      Reply
    - Andrew Jenner said
      
      March 21, 2011 at 12:26 am
      Jason, would you mind running a little experiment for me on your 8088 machines? I’ve put asm source and a binary here. This is loosely based on Michael Abrash’s precision Zen timer, but with some simplifications which should make it easier to get even more precise timings.
      
      This program runs up to 1000 iterations of “mov ax,[di]” and “mov bx,[di]” in steps of 100 iterations. According to my understanding, they should both take 17 cycles per instruction (12+5) on the EA and 16 cycles per instruction (2 bytes code, 2 bytes data) on the BIU, so should be EU-bound. If you’re right and there’s no EA penalty for “mov ax,[di]” then the first one should take 16 cycles per instruction (might be less on the EA but it’ll be IO bound).
      
      17 cycles works out at (17*100/4 =) an increase of 425 clock ticks when we add 100 instructions. Adjusted for DRAM refresh cycles gives us (425*72/68 =) 450. Similarly, 16 cycles works out at 400 clock ticks raw or about 424 clock ticks taking into account the DRAM refresh.
      
      I tested this on DOSBox and MESS – DOSBox gave about 40 clock ticks per 100 instructions (I think it’s 386-class). MESS gave 325 on the nose – this is because MESS uses the instruction timings from HELPPC, but it doesn’t simulate the bus, the prefetch queue or DRAM refresh, and it uses the 8086 timings even when it’s supposed to be emulating an 8088 (these instructions are documented at 12+5 cycles on 8088, 8+5 cycles on 8086).
      
      I’d be very interested to see what these do on your machines.
      
      Reply
      - Jason Knight said
        
        March 21, 2011 at 10:31 am
        Results are interesting tested across three machines…
        
        Tandy 1000 HX — SEIMENS 8088-P-2 at 7.16mhz both result in 4106, while at 4.77mhz it comes in at 4808… I suspect your code is being memory bus locked as the clock speed isn’t making that big a difference. Not too suprising with the 4 byte opcode and 2 byte memory op on a 17 clock operation — since that hangs the clock 12 cycles on the first call and 3 clocks each there-after waiting on the BUI. 2 bytes data + 3 bytes opcode = 20 clocks for the BUI to the 17 clocks on the EU.
        
        Tandy 1000 SX — NEC D8088D-2 at 7.16mhz I get 3982 for AX and 4106 for the BX, down at 4.77mhz I get 4706 for AX and 4808 for BX.
        
        Sharp PC 7000 — Intel 8086 at 7.16mhz I get 3534 for AX and 4088 for BX… down at 4.77mhz it comes in at 4808 for AX and 5026 for BX.
        
        (I though the 7000 was a 8088, I was wrong, it’s a 86)
        
        The SEIMENS is most likely a 1:1 knockoff of the original intel design, so I suspect that you are correct in that on a real Intel 8088 it is NOT EA immune. The NEC even though it’s NOT a V20, appears to have some opimizations over the SEIMENS that nets a bit of a speed boost on AX operations.,, The 8086 appears to also treat all AX MOV operations faster, though it appears in both cases if you do the math, we’re not seeing EA eliminated, it’s just lower… and not by the same amount.
        
        I guess it’s NOT EA being ignored, as it is compiled to the same opcode for AX or BX; it’s just that on some 8088 clones and the 8086 there’s some other optimization going on.
        
        I really do wish more opcode lists and assembly guides explained in detail the difference between:
        
        mov acc,mem
        mov mem,acc
        
        and
        
        mov reg,mem
        mov mem,reg
        
        As nothing I’ve ever read has clearly stated the former cannot be used with displacements or is some sort of different opcode.
        
        Hell, it could come down to a compiler difference if it’s not using the latter when it can. I’ll see if I can get your example to compile in TASM for a laugh, see if the result is different.
        
        What did you compile that with anyhow? It’s sad, but I’ve seen performance differences between assemblers (!?!) which something like this could help explain.
        
        Reply
        
        Andrew Jenner said
        
        March 21, 2011 at 1:00 pm
        Very interesting – thanks for that. It’s not surprising that the code is bus-bound at 7.16MHz since it’s only just EU-bound at 4.77MHz. The 8086 results are surprising – I’d have thought that they would be 13 cycles just as in MESS (8 BIU cycles for 2 bus words and 13 EU cycles) so I’m not sure what’s going on there (maybe there’s a way to put an 8086 on an 8-bit bus and have it act like an 8088?)
        
        If you send me the other counts (for 100-900) I can do a more detailed analysis and hopefully eliminate the startup and DRAM refresh effects completely so we can see the exact cycles per instruction count (you can redirect the output of mincount to a file by running “mincount >results.txt” if that makes it easier).
        
        Based on the data you’ve provided, I agree with your analysis though.
        
        Yes, the HELPPC asm.txt file (and others based on it) don’t say that the “mov accum,mem” form is a different opcode to “mov reg,mem” and is offset only. I think I learnt that through looking at the encoding tables on sandpile.org and/or the Intel manuals which give all the instruction encodings. The short version of the story is that there are two ways to encoding “MOV AX,[data]”:
        
        A1 iw: MOV AX,[iw] – 3 bytes, 14 EU cycles on 8088, no EA penalty
        8B /r: MOV rm,rmw – 4 bytes, 12+6 EU cycles on 8088 including EA penalty.
        
        I think most assemblers should know to pick use the first one when they can (A86, which is what I used to assemble mintimer, does). I can well imagine that some very simple assemblers might not though, or might have other performance differences (using a long jump where a short one would do or failing to use the other shorter opcodes like 40-5F, 91-97 and B0-BF).
        
        There is only one opcode for “MOV AX,[DI]”, though, so using a different assembler won’t make any difference to the timings in this case.
        
        Reply
        
        Jason Knight said
        
        March 21, 2011 at 2:49 pm
        The assembler differences were something I’ve noticed a lot — the bit about missing when a short jump would be better is something I’ve seen MASM screw up a LOT, which is why I started using TASM in the first place.
        
        Though right now for ASM I’m using the inline-compiler in turbo pascal 7 — so who knows what optimizations that may or may not be doing. I’ve had a few cases where TP7 tries to compile a short jump when the target is actually too far away, resulting in jumps to the wrong part of the code!
        
        I’ll pull the full timings from your program when I get a chance.
Jason Knight said

March 19, 2011 at 7:59 pm
I think I figured out why mine is faster than you guys — I’m profiling the code WITH two back-to-back iterations and a loop.

I was sitting here calculating it, and the fetch for the loop and no prefetch after LOOP… with a 3 byte long opcode at the start — is making it come in at 165 clocks on paper, with an extra 8 clocks because of a cache stall. (if you have an opcode larger than fits into the prefech in the interval it takes for each byte, any extra bytes take TRIPLE the read time — don’t ask why, it just does…)… then to make it work in-program I had to put in a “add bx,2” for the loop — which is a 4/4 (4 bytes, 4 cycle execute) — death on cache and causing a second stall.

So what was 143 on paper ends up 172… when my original was 165.

BUT… Here’s a laugh — we’ve been talking about a routine that hasn’t been used in gameplay since version 1.0 — I had already scrapped that for gameplay in 1.2 — the gameplay version of that routine went:
```
	lodsw
	mov  bx,ax
	mov  al,es:[di]
	and  al,bh
	or   al,bl
	stosb
```
For writing to the backbuffer, and:

@loop: movsb inc di movsb add di,bx add si,$004E loop @loop

For copying the backbuffer to screen….

Though your suggestions DID help once I got rid of trying to use the mov at the start — I fail to see how a 3 byte 21 cycle opcode is going to be faster than a 1 byte 16 cycle one… The trick with the “AND” though really makes a difference — so now the backbuffer blit is going to be:
```
	lodsw
	and  ah,es:[di]
	or   al,ah
	stosb
```
for backbuffer. I also made a change to screen where I’m storing the $004E in BX — which speeds it up since fetch2 for add di,bx ends up a hell of a lot faster than the four byte add di,$004E, as it avoids a cache stall before trying to fetch loop.

I also unrolled the 4×3 blit — NOT going to even try to unroll the 8×6 one.

Which I guess means I am going to release a version 1.5

It’s funny how a “mov bx,immed” followed by two “add di,bx” can be faster than two “add di,immed” — my stupid little “null a 4×3 tile area on the byte boundary” routine ended up 50% faster when I changed it from this:
```
	mov  cx,3
@loop:
	mov  es:[di],ax
	add  di,$50
	loop @loop
```
to this:
```
	mov  bx,$50
	mov  es:[di],ax
	add  di,bx
	mov  es:[di],ax
	add  di,bx
	mov  es:[di],ax
```
Though I suspect the 17 clocks per loop are as much to blame for the speed increase.

I made a spreadsheet in OoO for figuring out the clocks. I’ll clean it up tomorrow and post a link for you folks to have a look-see. It’s probably not perfect, but it seems consistent in testing on three different 8088’s… A Seimens 8088-P-2, a NEC D8088D-2, and an actual Intel. (forgot I had one in the Sharp PC-7000)

Reply
- Jason Knight said
  
  March 19, 2011 at 8:01 pm
  Doh… maybe it would be faster instead of add immed to do di+$50? It is an accumulator operation and so immune to EA…
  
  Reply
  - Jason Knight said
    
    March 19, 2011 at 8:12 pm
    Oh yeah, much better:
```
mov  es:[di],ax
mov  es:[di+$50],ax
mov  es:[di+$A0],ax
```
    No loops, no add’s… and less code too. THERE’S someplace the offset made sense.
    
    could probably make that faster by playing with DS and BX, but it’s TP, having to preserve DS usually makes it better to just use ES.
    
    Reply
Jason Knight said

March 21, 2011 at 1:55 pm
For reference, I just released 1.5 which makes use of our discussion here. I’ve also backlinked to this page on the various announcements…

Me being a firm believer in giving credit when others help and all :D

http://www.cutcodedown.com/retroGames/paku_1_5.rar

Can’t believe what a train wreck the distro version of 1.4 ended up compared to the 1.4 beta binary and hard source. I’ve got to stop trying to use extra “tools” to manage code, it NEVER works out for me… but then I’m the guy who can’t even stand color syntax highlighting as it makes the code hard for me to read.

Reply
Terje Mathisen said

November 2, 2011 at 2:23 am
It was really fun to read all the comments here, I haven’t worked at this level on 8088 since around 1985 or so!

There is one thing I haven’t seen mentioned, and that is the possibility to update both the back buffer (which you do need, since reading from the screen buffer really sucks) and the screen buffer at once in the sprite code.

lodsw
mov bx,ax
mov al,es:[di]
and al,bh
or al,bl
stosb

If you can keep both the back buffer and the sprites in the same segment, then ES can point to the screen, DX saves the initial word leaving BX free, and allowing the update to write to both (both buffers must be at the same offset of course).

mov [di],al
stosb

There’s probably a very good reason why this cannot work, besides snow suppression?

Reply
Jason Knight said

November 9, 2011 at 12:14 pm
There are a few reasons…

1) the backbuffer exists so we DON’T write to the screen at the same time, so sprites can be layered. All of the sprites have to be layered atop each-other with their transparencies BEFORE being shown on the screen. It’s why it’s a three layer composite — playfield, back-buffer and screen. We restore the modified (and ONLY the modified) parts of the playfield to the back-buffer, layer the sprites on the backbuffer, then show all the parts of the backbuffer that are different

On real CGA systems at 8088 speeds you don’t have the time to write to CGA memory the entire buffers or sprites and keep your framerate up — this is further compounded by the CGA’s RAM being about half as fast as system RAM…. This really kicks PCjr owners in the crotch since their system RAM is shared with the CGA so the bottom 128k performs at the slower CGA speeds. (making the Jr. Slower than a real PC or XT)

2) the backbuffer is pixel packed, the screen the foreground/background byte is packed, but every other byte is a character code — specifically 0xDD — so they’re not the same format nor use the same pointers. Making the backbuffers the same format would make the blits slower, hard to deal with, and increase the program’s memory footprint by 4k.

3) memory is allocated via turbo pascal’s heap, meaning each major getmem/new is going to have it’s own segment.

4) LODSW only works with DS… so you’re going to have a different segment ANYWAYS.

5) Snow supression doesn’t even play into it — it has snow, as there’s not enough time in the htretrace or vretrace to do what I’m doing. Originally I was trying to avoid it, but realistically without snow in text mode the most you can manage is around 15 bytes (realistically more like 11 bytes) per 35hz refresh — which is way too slow to do any game in really.

P.S. Version 1.6 was released last night — the game has a new home at:

http://www.deathshadow.com/pakuPaku

Reply
Terje Mathisen said

November 10, 2011 at 2:55 am
Thanks for the reply!

Re some of your points:

1) Later on, the EGA/VGA suffered from an even larger handicap speedwise vs system RAM, I believe reading from the frame buffer was 3-5 x slower than regular memory. :-(

It was so slow that it was faster to scroll a screen by overwriting everything than to use the BIOS function to scroll up by a line. It was far better to allocate a larger buffer and then just update the starting pointer…

What I did for my terminal emulator was to update the back buffer during serial stream decoding, then copy any modified parts to the real frame buffer only when idle (or when too long had passed since the last update).

2) OK, that’s a good reason.

3&4) I used TP for many years, that’s no reason to not have your own suballocator: I.e. you grab a full segment, then split it into the various components that would be helped by sharing the same segment address!

Avoiding segment reg loads helps even more on later x86 versions…

5) I did snow suppression by only copying a few bytes during hretrace, and a bunch more during vretrace, but I had (on average) much less data to update in the frame buffer.

Reply

	Matthew Garrett: Wha… on 8088 MPH: We Break All Your…
	The Incredible Demo… on 8088 MPH: We Break All Your…
	Trixter on 8088 MPH: We Break All Your…
	wh0phd on 8088 MPH: We Break All Your…
	John Olson on Cyberpunx

Oldskooler Ramblings

the unlikely child born of the home computer wars

Recent Posts

Recent Comments

Pages

Meta

Top Posts

Archives

Blog Stats