Quite welcome. In the next year or so, there might be a new series of 8088 optimization articles.

]]>Doesn’t work on NEC V20/V30, and is almost an order of magnitude slower than XLAT (AAM on 808x takes 83 cycles).

]]>of course, I haven’t checked the compatibility (does it exist on the NEC?) or the timings, but it seems like it would be an improvement for some code paths, at least.

it also frees bx for bp replacement. ]]>

No, that would not work. Peter’s code would return to the string instruction with CX unchanged as opposed to your code which returns with CX decremented by 1.

Thanks for a very interesting article!

]]>Loading EAX would be safe, but writing EAX to [ebx+2] could write 5 bytes past the current end of the output buffer…

This is still fine, since LZ4 always ends with a literal, so even if this is the last match and it is just 4 bytes long, the extra byte will be overwritten by the final copy.

OTOH, using AX doubles the chances of avoiding misaligned load and/or store operations.

]]>mov ax,[ebx]

mov [ebx+2],ax

you’d save two bytes by using eax in both cases.

Nice! You’ve saved me the work if I ever move up to 32-bit prot mode (unlikely, as I’m still exploring what 8088 can do, but never say never)

]]>The idea is that I can turn 1-byte RLE into 2-byte by simply duplicating the last byte in the output buffer, and I can turn 2-byte into 4-byte if I start by duplicating the last word!

; EDX has match offset, ECX count (>= 4), EBX = EDI-EDX

cmp edx, 2

jb do_stosb

je do_stosw

cmp edx,4

je do_stosd

; Use REP MOVS to copy the (possibly overlapping) range

do_movs:

lea edx,[ebx+ecx]

xchg ebx,esi ; Save ESI and point it at match source

cmp edx,edi

ja overlapping_range

mov edx,ecx

shr ecx,2

and edx,3

rep movsd

mov ecx,edx

overlapping_range:

rep movsb

mov esi,ebx

jmp start_literal

do_stosb: ; Duplicate the last byte

mov al,[ebx]

mov [edi],al

do_stosw: ; Duplicate the last word

mov ax,[ebx]

mov [ebx+2],ax

do_stosd:

xor edx,edx

mov eax,[ebx] ; All four bytes will be valid!

sub edx,ecx

add ecx,3

and edx,3

shr ecx,2

rep stosd

sub edi,edx

start_literal:

]]>Coming from you, that means very much to me, thanks!

]]>Detecting one and two-byte match loops that could be turned into rep stos was a very nice idea!

]]>Ah, I see what you mean. Yes, that would work and saves 5 cycles from eliminating the JCXZ fall-through, and INC CX is a single-byte instruction. I can’t time this right now but I’m pretty sure it’s a win in both speed and size. Nice :-)

]]>I knew I had quarter-square multiplication on the brain from somewhere!

]]>I realised this morning that it’s the result of non-stream mode. the command-line in compress.bat is necessary to avoid that problem. I should have read the documentation.

]]>you don’t have the jcxz in the second case.

@again:

rep …

inc cx

loop @again

if the rep completed, cx will be zero, inc->cx=1, loop->cx=0 again and fall through.

if the rep did not complete, then inc->cx=cx+1, loop->cx=cx-1, and it continues from @again with the correct count.

My most recent blog post is about the quarter-square method! http://www.reenigne.org/blog/multiplying-faster-with-squares/

Normally you do rotations by multiplying a 3-element vector with a 3×3 matrix, which is 9 multiplications. However, if you only rotate about two axes you can get a guaranteed zero in one of the matrix elements, which takes you down to 8 multiplications per vertex. If you can exploit some symmetries in the model you’re drawing you can probably get it down even further.

Given three 2D points A, B and C in a 2D space, there’s a really easy way to tell visiting the points in the order A, B, C has you going clockwise or anticlockwise – it’s just the sign of the cross product of the vectors AB and AC. That orientation corresponds to whether the triangle that you’re drawing is facing towards or away from you (assuming that the order is the same for all your triangles – clockwise is just a convention). Obviously that only helps if you want to draw only one side of each triangle, for example if you’re drawing a solid object – if you’re drawing a thin surface then it doesn’t help because you need to draw both sides of each triangle anyway.

Knowing that (and that you can use the same “orientation” method to figure out if a point is inside or outside a triangle) was actually what got me my first job I think – it certainly seemed to impress the interviewer, as it was a better solution than the one he knew!

]]>I hadn’t seen your post, but I just now read it, and I think the lightbulb started to flicker a little. When I decide to set aside a few weeks and dedicate myself to the task, I’ll show you what I’ve got and maybe we can take it from there. I have seen all sorts of tricks like multiplication via quarter-square method, performing less calcs if not rotating by all three axis, storing vertices in clockwise order to help with culling and rendering, etc. but I only have partial understanding of those methods.

Drawing points and lines and filling a scanline quickly — that I’ve got down pat, no worries there.

]]>Your second example is faster, but it doesn’t do the same thing as the former. If the REP completes, CX will be 0, whereas your INC CX will roll it around to FFFF and the loop will start over and run for the maximum.

I responded to your private email, and some of your suggestions did increase speed, so there will be a second release of the code soon :-)

]]>I definitely want to help teach you fast 3D. Have you read my blog post about deriving the equations for 3D graphics (http://www.reenigne.org/blog/equations-for-3d-graphics)? I think after that it’s mostly just a question of optimization by massive application of lookup tables, the exact set of lookup tables you use depending on just what kind of effect you’re doing.

Then there’s the question of drawing points, lines and triangles quickly, which is a whole other optimization problem. I’ve played about a little with some of these routines and there’s lots of interesting optimizations that can be done there too.

]]>JCXZ next ; continue if REP completed

LOOP @again ; keep trying if REP never completed

next:

faster than

INC CX ; compensate for LOOP effect

LOOP @again ; keep trying if REP never completed

?