Compiler Bugs?
I've been fairly lazy with working on personal coding projects over the past month, but I can say at least that some progress has been made on things. Small progress. Some bits of optimization work and bug fixing with various drawing functions such as bit-blits. Upon discovering some of the bugs I then had to spend time fixing, I realized that I really needed to re-prioritize making some kind of test suite for libDGL... which was something I kept putting off.
Anyway, first off, to follow up on some unanswered questions I had from my last post, I realized that the semi-lacking
code inlining behaviour of Watcom C 10.0 was just how it worked. I suspect it's probably a bug. According to all the
documentation I had read, the /oe
compiler option should have been able to adjust the size of inline functions that
the compiler would consider for inlining. The default setting is fairly small, and upon bumping it up I noticed
absolutely no difference. Didn't matter what I set it to. Hrm. Spending a bunch of time tweaking my code to see if it
was just a matter of helping the compiler out by giving it code it "likes" better proved equally fruitless. Just a
limitation (or bug) of that particular version of the compiler.
Some time later I had the opportunity to pick up a brand-new-in-box copy of Watcom 11.0. I actually wasn't originally
intending on getting this at all since I've read multiple comments that people seem to think 10.x is the "definitive"
version in terms of features and stability. But since I happened across it for cheap, I figured "meh, why not." If
nothing else, now I could make use of super easy inline assembly via _asm
blocks. This is when I started rewriting a
number of my drawing routines' inner loops and such into straight assembly. This wasn't even really required as I was
actually fairly happy with the performance I'd been getting from the straight-C implementations, but I figured why not,
it's now easy for me to do this.
One thing I noticed with Watcom 11.0 is that by default using just /oe
that the inlining behaviour worked basically
identically to what I saw with 10.0 However, increasing the size using say, /oe=40
(default is 20), actually made a
difference. So definitely a bug in 10.0.
Well, just yesterday I was fixing up a bug in the way I calculated the width of bit-blits after clipping was taken into
account and whether the blit can be done using rep movsd
s alone or using both rep movsd
s and a single small rep movsb
(this particular bug was also what made me realize that I really needed some kind of test suite, like right
now... I had made such a silly oversight in this code, heh). Upon some more thorough testing once I had finished, I
realized I had run into everyone's favourite type of bug: my code worked wonderfully when compiling with debug
settings, but not with release optimizations!
Anyway, it took me a little bit to figure out what was going on, but it appears to be a bug with how the compiler handles inline assembly that really has absolutely shattered my confidence in using this feature with Watcom going forward. It seems like this probably is pretty uncommon (I don't have this issue in any of my other routines), but even so... I don't want to have to second guess the compiler.
Anyway, so here's how I had written my surface_blit_region_f
routine. This routine does no clipping itself (it assumes
the source/destination regions are pre-clipped). As well, it's a solid bit-blit (no transparency handling). I realized
there were basically 3 different scenarios where this would be called:
- The source region has a width that is an even multiple of 4. Only
rep movsd
s are needed. This is probably the most common scenario since most graphics in games have dimensions that are powers of two like 16x16, 32x32, 64x64, etc. - The source region has a width > 4, with a remaining number of pixels <= 3.
rep movsd
s and a single remainingrep movsb
can be used. Probably the second most common scenario, especially when you have a partially clipped image. - The source region has a width < 4. A single
rep movsb
can be used. Probably the least common scenario, would likely occur only when an image is almost completely clipped off the screen as I don't think many games used image sizes of 3x3, 2x2, etc, but I guess once in a while it happens.
I originally had this all handled as a single loop that would intelligently call as many rep movsd
s that were needed
and then call rep movsb
if needed. Performance was pretty good. Splitting the code up into 3 different loops matching
the above scenarios didn't improve performance by much as expected, but I did get a little bit of a boost. Every bit is
nice.
Anyway, here's the code I ended up with:
void
Initial testing was good (when using debugging compiler options)! Then I switched to release optimizations and ran through more scenarios and noticed problems... eventually when I thought to look at the assembly output, I noticed this:
; void surface_blit_region_f(const SURFACE *src,
; SURFACE *dest,
; int src_x,
; int src_y,
; int src_width,
; int src_height,
; int dest_x,
; int dest_y) {
; const byte *psrc;
; byte *pdest;
; int lines;
surface_blit_region_f_:
push esi
push edi
push ebp
mov ebp,esp
sub esp,0000001cH
L53: mov esi,ebx
mov ebx,ecx
mov ecx,dword ptr +10H[ebp]
; int src_y_inc = src->width - src_width;
mov edi,dword ptr [eax]
sub edi,ecx
mov dword ptr -10H[ebp],edi
; int dest_y_inc = dest->width - src_width;
; int width_4, width_remainder;
;
; psrc = (const byte*)surface_pointer(src, src_x, src_y);
; pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
mov edi,dword ptr [edx]
sub edi,ecx
mov dword ptr -0cH[ebp],edi
L54: imul ebx,dword ptr [eax]
mov eax,dword ptr +8H[eax]
add esi,ebx
add eax,esi
mov dword ptr -1cH[ebp],eax
L55: mov ebx,dword ptr +1cH[ebp]
mov eax,dword ptr [edx]
imul eax,ebx
mov esi,dword ptr +18H[ebp]
mov edx,dword ptr +8H[edx]
add eax,esi
add eax,edx
mov dword ptr -18H[ebp],eax
; lines = src_height;
;
mov eax,dword ptr +14H[ebp]
mov dword ptr -14H[ebp],eax
mov edx,ecx
; width_4 = src_width / 4;
mov eax,ecx
sar edx,1fH
shl edx,02H
sbb eax,edx
sar eax,02H
mov dword ptr -8H[ebp],eax
; width_remainder = src_width & 3;
;
and ecx,00000003H
mov dword ptr -4H[ebp],ecx
; if (width_4 && !width_remainder) {
; // width is a multiple of 4 (no remainder)
mov edi,dword ptr -8H[ebp]
L56: test edi,edi
je short L57
cmp dword ptr -4H[ebp],00000000H
jne short L57
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov eax, width_4 // eax = number of 4-pixel runs (dwords)
;
; mov edx, lines // edx = line loop counter
; test edx, edx // make sure there is >0 lines to draw
; draw_line:
; jz done // if no more lines to draw, then we're done
;
; mov ecx, eax // draw all 4-pixel runs (dwords)
; rep movsd
;
; add esi, src_y_inc // move to next line
; add edi, dest_y_inc
; dec edx // decrease line loop counter
; jmp draw_line
; done:
; }
;
mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov eax,dword ptr -8H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
je short L58
mov ecx,eax
repe movsd
add esi,dword ptr -10H[ebp]
; } else if (width_4 && width_remainder) {
; // width is >= 4 and there is a remainder ( <= 3 )
jmp short L63
L57: cmp dword ptr -8H[ebp],00000000H
je short L61
DB 83H,7dH,0fcH,00H
je short L61
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov eax, width_4 // eax = number of 4-pixel runs (dwords)
; mov ebx, width_remainder // ebx = remaining number of pixels
;
; mov edx, lines // edx = line loop counter
; test edx, edx // make sure there is >0 lines to draw
; draw_line:
; jz done // if no more lines to draw, then we're done
;
; mov ecx, eax // draw all 4-pixel runs (dwords)
; rep movsd
; mov ecx, ebx // draw remaining pixels ( <= 3 bytes )
; rep movsb
;
; add esi, src_y_inc // move to next line
; add edi, dest_y_inc
; dec edx // decrease line loop counter
; jmp draw_line
; done:
; }
;
mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov eax,dword ptr -8H[ebp]
mov ebx,dword ptr -4H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
L59: je short L60
mov ecx,eax
repe movsd
mov ecx,ebx
repe movsb
add esi,dword ptr -10H[ebp]
add edi,dword ptr -0cH[ebp]
dec edx
jmp short L59
; } else {
; // width is <= 3
L60: jmp short L64
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov eax, width_remainder // ebx = number of pixels to draw (bytes)
;
; mov edx, lines // edx = line loop counter
; test edx, edx // make sure there is >0 lines to draw
; draw_line:
; jz done // if no more lines to draw, then we're done
;
; mov ecx, ebx // draw pixels (bytes)
; rep movsb
;
; add esi, src_y_inc // move to next line
; add edi, dest_y_inc
; dec edx // decrease line loop counter
; jmp draw_line
; done:
; }
; }
L61: mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov eax,dword ptr -4H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
L62: je short L64
mov ecx,ebx
repe movsb
add esi,dword ptr -10H[ebp]
L63: add edi,dword ptr -0cH[ebp]
dec edx
jmp short L62
; }
;
L64: mov esp,ebp
pop ebp
pop edi
pop esi
ret 0010H
At first glance, this may look fine. But look at the code for the first scenario blit within the
width_4 && !width_remainder
condition:
mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov eax,dword ptr -8H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
je short L58
mov ecx,eax
repe movsd
add esi,dword ptr -10H[ebp]
; } else if (width_4 && width_remainder) {
; // width is >= 4 and there is a remainder ( <= 3 )
jmp short L63
L57: cmp dword ptr -8H[ebp],00000000H
je short L61
DB 83H,7dH,0fcH,00H
je short L61
Uhh, what? The compiler just appeared to have chopped off the last bit of the blit loop and then headed on to the
following else if
. Well, what's at the L63
label that it jumps to...
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov eax, width_remainder // ebx = number of pixels to draw (bytes)
;
; mov edx, lines // edx = line loop counter
; test edx, edx // make sure there is >0 lines to draw
; draw_line:
; jz done // if no more lines to draw, then we're done
;
; mov ecx, ebx // draw pixels (bytes)
; rep movsb
;
; add esi, src_y_inc // move to next line
; add edi, dest_y_inc
; dec edx // decrease line loop counter
; jmp draw_line
; done:
; }
; }
L61: mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov eax,dword ptr -4H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
L62: je short L64
mov ecx,ebx
repe movsb
add esi,dword ptr -10H[ebp]
L63: add edi,dword ptr -0cH[ebp]
dec edx
jmp short L62
Huh. It jumps into the end of the third blit scenario's loop. Of course, then it jumps back to L62
and continues
running the wrong blit from that point on.
So obviously at this point the logical conclusion is that the compiler was mixed up because I had three _asm
blocks
with identical labels. This was actually something I had used elsewhere with no problems, as Watcom seems able to make
labels within any _asm
block unique to that block only. But what I was seeing here seemed to indicate that this was
maybe not a bullet proof feature. So I changed all the labels to be uniquely named and noticed no change whatsoever!
Huh?
A short while later I just decided to try adding nop
s at random places. Confusingly enough, that seemed to do the
trick:
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov eax, width_4 // eax = number of 4-pixel runs (dwords)
;
; mov edx, lines // edx = line loop counter
; test edx, edx // make sure there is >0 lines to draw
; draw_line:
; jz done // if no more lines to draw, then we're done
;
; mov ecx, eax // draw all 4-pixel runs (dwords)
; rep movsd
;
; nop
; add esi, src_y_inc // move to next line
; add edi, dest_y_inc
; dec edx // decrease line loop counter
; jmp draw_line
; done:
; }
;
mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov eax,dword ptr -8H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
L57: je short L58
mov ecx,eax
repe movsd
nop
add esi,dword ptr -10H[ebp]
add edi,dword ptr -0cH[ebp]
dec edx
jmp short L57
I experimented with the placement of the nop
a little more and it can be pretty much anywhere after L57
as shown
above (and before the final jmp
of course) and it "fixes" the problem. What the heck?
Another thing I tried was rearranging where I do the jz
(or je
):
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov ebx, width_4 // eax = number of 4-pixel runs (dwords)
;
; mov edx, lines // edx = line loop counter
; test edx, edx // make sure there is >0 lines to draw
; jz done // if no more lines to draw, then we're done
; draw_line:
; mov ecx, ebx // draw all 4-pixel runs (dwords)
; rep movsd
;
; add esi, src_y_inc // move to next line
; add edi, dest_y_inc
; dec edx // decrease line loop counter
; jz done // if no more lines to draw, then we're done
; jmp draw_line
; done:
; }
;
mov esi,dword ptr -1cH[ebp]
mov edi,dword ptr -18H[ebp]
mov ebx,dword ptr -8H[ebp]
mov edx,dword ptr -14H[ebp]
test edx,edx
je short L58
L57: mov ecx,ebx
repe movsd
add esi,dword ptr -10H[ebp]
add edi,dword ptr -0cH[ebp]
dec edx
je short L58
jmp short L57
So that solves the problem too.
Of course, I don't like this at all. Why does this particular piece of code cause the compiler to mess up like this? I
fear I may never know! One last thing I wanted to try was using the Watcom-recommended approach to inline assembly, and
to use #pragma aux
instead of _asm
. This was also improved in 11.0 allowing you to also refer to your C variables
just as I was doing here with _asm
. Of course, the syntax is far uglier, but it does have the added benefit of
allowing the compiler to stitch together your assembly with the surrounding code a bit better:
void
And the relevant compiler generated output:
; if (width_4 && !width_remainder) {
; // width is a multiple of 4 (no remainder)
; extern void _inner_blit4(byte *dest, const byte *src, int width4, int lines);
; #pragma aux _inner_blit4 = \
; " test edx, edx" \
; "draw_line:" \
; " jz done" \
; "" \
; " mov ecx, eax" \
; " rep movsd" \
; "" \
; " add esi, src_y_inc" \
; " add edi, dest_y_inc" \
; " dec edx" \
; " jmp draw_line" \
; "done:" \
; parm [edi] [esi] [eax] [edx] \
; modify [ecx];
;
L56: test eax,eax
je short L57
test ebx,ebx
jne short L57
; _inner_blit4(pdest, psrc, width_4, lines);
;
mov edx,edi
mov edi,ecx
test edx,edx
je short L58
mov ecx,eax
repe movsd
add esi,dword ptr -8H[ebp]
; } else if (width_4 && width_remainder) {
; // width is >= 4 and there is a remainder ( <= 3 )
; extern void _inner_blit4r(byte *dest, const byte *src, int width4, int remainder, int lines);
; #pragma aux _inner_blit4r = \
; " test edx, edx" \
; "draw_line:" \
; " jz done" \
; "" \
; " mov ecx, eax" \
; " rep movsd" \
; " mov ecx, ebx" \
; " rep movsb" \
; "" \
; " add esi, src_y_inc" \
; " add edi, dest_y_inc" \
; " dec edx" \
; " jmp draw_line" \
; "done:" \
; parm [edi] [esi] [eax] [ebx] [edx] \
; modify [ecx];
;
jmp short L63
L57: test eax,eax
je short L61
test ebx,ebx
DB 74H,21H
So, still results in the same bug. Bleh.
Just to take another opportunity to dump a bunch more code in this post, here's my surface_blit_sprite_region_f
function which is basically the same idea as surface_blit_region_f
, except that as it's name suggests, it deals with
transparency and skips over source pixels that are colour zero. But the same idea of splitting it up into three
separate blit loops for the different scenarios outlined above is still there, complete with three separate _asm
blocks:
void
And the code that the compiler generated:
; void surface_blit_sprite_region_f(const SURFACE *src,
; SURFACE *dest,
; int src_x,
; int src_y,
; int src_width,
; int src_height,
; int dest_x,
; int dest_y) {
; const byte *psrc;
; byte *pdest;
; byte pixel;
; int src_y_inc, dest_y_inc;
; int width, width_4, width_remainder;
; int lines_left;
; int x;
;
; psrc = (const byte*)surface_pointer(src, src_x, src_y);
surface_blit_sprite_region_f_:
push esi
push edi
push ebp
mov ebp,esp
sub esp,00000020H
L69: mov esi,dword ptr [eax]
imul ecx,esi
add ebx,ecx
mov ecx,dword ptr +8H[eax]
add ecx,ebx
mov dword ptr -20H[ebp],ecx
; src_y_inc = src->width;
; pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
mov dword ptr -18H[ebp],esi
L70: mov edi,dword ptr +1cH[ebp]
mov eax,dword ptr [edx]
imul eax,edi
add eax,dword ptr +18H[ebp]
mov ecx,dword ptr +8H[edx]
add eax,ecx
mov dword ptr -1cH[ebp],eax
; dest_y_inc = dest->width;
mov eax,dword ptr [edx]
mov dword ptr -14H[ebp],eax
; width = src_width;
mov eax,dword ptr +10H[ebp]
mov dword ptr -10H[ebp],eax
; lines_left = src_height;
mov eax,dword ptr +14H[ebp]
mov dword ptr -4H[ebp],eax
; src_y_inc -= width;
mov eax,dword ptr -10H[ebp]
sub dword ptr -18H[ebp],eax
; dest_y_inc -= width;
;
mov eax,dword ptr -10H[ebp]
sub dword ptr -14H[ebp],eax
; width_4 = width / 4;
mov eax,dword ptr -10H[ebp]
mov edx,dword ptr -10H[ebp]
sar edx,1fH
shl edx,02H
sbb eax,edx
sar eax,02H
mov dword ptr -0cH[ebp],eax
; width_remainder = width & 3;
;
mov eax,dword ptr -10H[ebp]
and eax,00000003H
mov dword ptr -8H[ebp],eax
; if (width_4 && !width_remainder) {
; // width is a multiple of 4 (no remainder)
mov edi,dword ptr -0cH[ebp]
L71: test edi,edi
je short L79
cmp dword ptr -8H[ebp],00000000H
jne short L79
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov ebx, width_4 // get number of 4-pixel runs to be drawn
; mov ecx, lines_left
; test ecx, ecx // make sure there is >0 lines to be drawn
; draw_line:
; jz done
;
; start_4_run:
; mov edx, ebx // dx = counter of 4-pixel runs left to draw
; draw_px_0:
; mov al, [esi]+0 // load src pixel
; test al, al
; jz draw_px_1 // if it is color 0, skip it
; mov [edi]+0, al // otherwise, draw it onto dest
; draw_px_1:
; mov al, [esi]+1
; test al, al
; jz draw_px_2
; mov [edi]+1, al
; draw_px_2:
; mov al, [esi]+2
; test al, al
; jz draw_px_3
; mov [edi]+2, al
; draw_px_3:
; mov al, [esi]+3
; test al, al
; jz end_4_run
; mov [edi]+3, al
; end_4_run:
; add esi, 4 // move src and dest up 4 pixels
; add edi, 4
; dec edx // decrease 4-pixel run loop counter
; jnz draw_px_0 // if there are still more runs, draw them
;
; end_line:
; add esi, src_y_inc // move src and dest to start of next line
; add edi, dest_y_inc
; dec ecx // decrease line loop counter
; jmp draw_line
; done:
; }
;
;
mov esi,dword ptr -20H[ebp]
mov edi,dword ptr -1cH[ebp]
mov ebx,dword ptr -0cH[ebp]
mov ecx,dword ptr -4H[ebp]
test ecx,ecx
L72: je short L78
mov edx,ebx
L73: mov al,byte ptr [esi]
test al,al
je short L74
mov byte ptr [edi],al
L74: mov al,byte ptr +1H[esi]
test al,al
je short L75
mov byte ptr +1H[edi],al
L75: mov al,byte ptr +2H[esi]
test al,al
je short L76
mov byte ptr +2H[edi],al
L76: mov al,byte ptr +3H[esi]
test al,al
je short L77
mov byte ptr +3H[edi],al
L77: add esi,00000004H
add edi,00000004H
dec edx
jne short L73
add esi,dword ptr -18H[ebp]
add edi,dword ptr -14H[ebp]
dec ecx
jmp short L72
; } else if (width_4 && width_remainder) {
; // width is >= 4 and there is a remainder ( <= 3 )
L78: jmp near ptr L96
L79: cmp dword ptr -0cH[ebp],00000000H
je near ptr L91
cmp dword ptr -8H[ebp],00000000H
je near ptr L91
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov ebx, width_4 // get number of 4-pixel runs to be drawn
; mov ecx, lines_left
; test ecx, ecx // make sure there is >0 lines to be drawn
; draw_line:
; jz done
;
; test ebx, ebx
; jz start_remainder_run // if no 4-pixel runs, just draw remainder
;
; start_4_run: // draw 4-pixel runs first
; mov edx, ebx // dx = counter of 4-pixel runs left to draw
; draw_px_0:
; mov al, [esi]+0 // load src pixel
; test al, al
; jz draw_px_1 // if it is color 0, skip it
; mov [edi]+0, al // otherwise, draw it onto dest
; draw_px_1:
; mov al, [esi]+1
; test al, al
; jz draw_px_2
; mov [edi]+1, al
; draw_px_2:
; mov al, [esi]+2
; test al, al
; jz draw_px_3
; mov [edi]+2, al
; draw_px_3:
; mov al, [esi]+3
; test al, al
; jz end_4_run
; mov [edi]+3, al
; end_4_run:
; add esi, 4 // move src and dest up 4 pixels
; add edi, 4
; dec edx // decrease 4-pixel run loop counter
; jnz draw_px_0 // if there are still more runs, draw them
;
; start_remainder_run: // now draw remaining pixels ( <= 3 pixels )
; mov edx, width_remainder // dx = counter of remaining pixels
; test edx, edx
; jz end_line // if no remaining pixels, goto line end
;
; draw_pixel:
; mov al, [esi] // load pixel
; inc esi
; test al, al // if zero, skip to next pixel
; jz end_pixel
; mov [edi], al // else, draw pixel
; end_pixel:
; inc edi
; dec edx
; jz end_line // loop while (x)
; jmp draw_pixel
;
; end_line:
; add esi, src_y_inc // move src and dest to start of next line
; add edi, dest_y_inc
; dec ecx // decrease line loop counter
; jmp draw_line
; done:
; }
;
; } else {
; // width is <= 3
mov esi,dword ptr -20H[ebp]
mov edi,dword ptr -1cH[ebp]
mov ebx,dword ptr -0cH[ebp]
mov ecx,dword ptr -4H[ebp]
test ecx,ecx
L80: je short L90
test ebx,ebx
je short L86
mov edx,ebx
L81: mov al,byte ptr [esi]
test al,al
je short L82
mov byte ptr [edi],al
L82: mov al,byte ptr +1H[esi]
test al,al
je short L83
mov byte ptr +1H[edi],al
L83: mov al,byte ptr +2H[esi]
test al,al
je short L84
mov byte ptr +2H[edi],al
L84: mov al,byte ptr +3H[esi]
test al,al
je short L85
mov byte ptr +3H[edi],al
L85: add esi,00000004H
add edi,00000004H
dec edx
jne short L81
L86: mov edx,dword ptr -8H[ebp]
test edx,edx
je short L89
L87: mov al,byte ptr [esi]
inc esi
test al,al
je short L88
mov byte ptr [edi],al
L88: inc edi
dec edx
je short L89
jmp short L87
L89: add esi,dword ptr -18H[ebp]
add edi,dword ptr -14H[ebp]
dec ecx
jmp short L80
L90: mov esp,ebp
pop ebp
pop edi
pop esi
ret 0010H
; _asm {
; mov esi, psrc
; mov edi, pdest
;
; mov ebx, width // get number of pixels to be drawn
; mov ecx, lines_left
; test ecx, ecx // make sure there is >0 lines to be drawn
; draw_line:
; jz done
;
; mov edx, ebx // dx = counter of remaining pixels
; draw_pixel:
; mov al, [esi] // load pixel
; inc esi
; test al, al // if zero, skip to next pixel
; jz end_pixel
; mov [edi], al // else, draw pixel
; end_pixel:
; inc edi
; dec edx
; jz end_line // loop while (x)
; jmp draw_pixel
;
; end_line:
; add esi, src_y_inc // move src and dest to start of next line
; add edi, dest_y_inc
; dec ecx // decrease line loop counter
; jmp draw_line
; done:
; }
; }
L91: mov esi,dword ptr -20H[ebp]
mov edi,dword ptr -1cH[ebp]
mov ebx,dword ptr -10H[ebp]
mov ecx,dword ptr -4H[ebp]
test ecx,ecx
L92: je short L96
mov edx,ebx
L93: mov al,byte ptr [esi]
inc esi
test al,al
je short L94
mov byte ptr [edi],al
L94: inc edi
dec edx
je short L95
jmp short L93
L95: add esi,dword ptr -18H[ebp]
add edi,dword ptr -14H[ebp]
dec ecx
jmp short L92
; }
;
L96: mov esp,ebp
pop ebp
pop edi
pop esi
ret 0010H
surface_blit_sprite_region_f
passes the exact same set of tests (under both, debug and release optimizations) that I
ran surface_blit_region_f
through with no problems at all. Aarrghh!
So what is the takeaway? Well, I won't be relying on inline assembly anymore. Realistically, I wasn't planning on having gobs of assembly in libDGL anyway. Just for some inner loops and such that I wanted to make sure were running as lean as possible. Instead of writing it all inline, I'll just be moving it out to separate assembly source files and using either WASM or TASM.