Compiler Bugs?

February 5, 2018 —

I've been fairly lazy with working on personal coding projects over the past month, but I can say at least that some progress has been made on things. Small progress. Some bits of optimization work and bug fixing with various drawing functions such as bit-blits. Upon discovering some of the bugs I then had to spend time fixing, I realized that I really needed to re-prioritize making some kind of test suite for libDGL... which was something I kept putting off.

Anyway, first off, to follow up on some unanswered questions I had from my last post, I realized that the semi-lacking code inlining behaviour of Watcom C 10.0 was just how it worked. I suspect it's probably a bug. According to all the documentation I had read, the /oe compiler option should have been able to adjust the size of inline functions that the compiler would consider for inlining. The default setting is fairly small, and upon bumping it up I noticed absolutely no difference. Didn't matter what I set it to. Hrm. Spending a bunch of time tweaking my code to see if it was just a matter of helping the compiler out by giving it code it "likes" better proved equally fruitless. Just a limitation (or bug) of that particular version of the compiler.

Some time later I had the opportunity to pick up a brand-new-in-box copy of Watcom 11.0. I actually wasn't originally intending on getting this at all since I've read multiple comments that people seem to think 10.x is the "definitive" version in terms of features and stability. But since I happened across it for cheap, I figured "meh, why not." If nothing else, now I could make use of super easy inline assembly via _asm blocks. This is when I started rewriting a number of my drawing routines' inner loops and such into straight assembly. This wasn't even really required as I was actually fairly happy with the performance I'd been getting from the straight-C implementations, but I figured why not, it's now easy for me to do this.

One thing I noticed with Watcom 11.0 is that by default using just /oe that the inlining behaviour worked basically identically to what I saw with 10.0 However, increasing the size using say, /oe=40 (default is 20), actually made a difference. So definitely a bug in 10.0.

Well, just yesterday I was fixing up a bug in the way I calculated the width of bit-blits after clipping was taken into account and whether the blit can be done using rep movsds alone or using both rep movsds and a single small rep movsb (this particular bug was also what made me realize that I really needed some kind of test suite, like right now... I had made such a silly oversight in this code, heh). Upon some more thorough testing once I had finished, I realized I had run into everyone's favourite type of bug: my code worked wonderfully when compiling with debug settings, but not with release optimizations!

Anyway, it took me a little bit to figure out what was going on, but it appears to be a bug with how the compiler handles inline assembly that really has absolutely shattered my confidence in using this feature with Watcom going forward. It seems like this probably is pretty uncommon (I don't have this issue in any of my other routines), but even so... I don't want to have to second guess the compiler.

Anyway, so here's how I had written my surface_blit_region_f routine. This routine does no clipping itself (it assumes the source/destination regions are pre-clipped). As well, it's a solid bit-blit (no transparency handling). I realized there were basically 3 different scenarios where this would be called:

  • The source region has a width that is an even multiple of 4. Only rep movsds are needed. This is probably the most common scenario since most graphics in games have dimensions that are powers of two like 16x16, 32x32, 64x64, etc.
  • The source region has a width > 4, with a remaining number of pixels <= 3. rep movsds and a single remaining rep movsb can be used. Probably the second most common scenario, especially when you have a partially clipped image.
  • The source region has a width < 4. A single rep movsb can be used. Probably the least common scenario, would likely occur only when an image is almost completely clipped off the screen as I don't think many games used image sizes of 3x3, 2x2, etc, but I guess once in a while it happens.

I originally had this all handled as a single loop that would intelligently call as many rep movsds that were needed and then call rep movsb if needed. Performance was pretty good. Splitting the code up into 3 different loops matching the above scenarios didn't improve performance by much as expected, but I did get a little bit of a boost. Every bit is nice.

Anyway, here's the code I ended up with:

void surface_blit_region_f(const SURFACE *src,
                           SURFACE *dest,
                           int src_x,
                           int src_y,
                           int src_width,
                           int src_height,
                           int dest_x,
                           int dest_y) {
    const byte *psrc;
    byte *pdest;
    int lines;
    int src_y_inc = src->width - src_width;
    int dest_y_inc = dest->width - src_width;
    int width_4, width_remainder;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    lines = src_height;

    width_4 = src_width / 4;
    width_remainder = src_width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_4     // eax = number of 4-pixel runs (dwords)

            mov edx, lines       // edx = line loop counter
            test edx, edx        // make sure there is >0 lines to draw
        draw_line:
            jz done              // if no more lines to draw, then we're done

            mov ecx, ebx         // draw all 4-pixel runs (dwords)
            rep movsd

            add esi, src_y_inc   // move to next line
            add edi, dest_y_inc
            dec edx              // decrease line loop counter
            jmp draw_line
        done:
        }

    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov eax, width_4         // eax = number of 4-pixel runs (dwords)
            mov ebx, width_remainder // ebx = remaining number of pixels

            mov edx, lines       // edx = line loop counter
            test edx, edx        // make sure there is >0 lines to draw
        draw_line:
            jz done              // if no more lines to draw, then we're done

            mov ecx, eax         // draw all 4-pixel runs (dwords)
            rep movsd
            mov ecx, ebx         // draw remaining pixels ( <= 3 bytes )
            rep movsb

            add esi, src_y_inc   // move to next line
            add edi, dest_y_inc
            dec edx              // decrease line loop counter
            jmp draw_line
        done:
        }

    } else {
        // width is <= 3
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_remainder // ebx = number of pixels to draw (bytes)

            mov edx, lines       // edx = line loop counter
            test edx, edx        // make sure there is >0 lines to draw
        draw_line:
            jz done              // if no more lines to draw, then we're done

            mov ecx, ebx         // draw pixels (bytes)
            rep movsb

            add esi, src_y_inc   // move to next line
            add edi, dest_y_inc
            dec edx              // decrease line loop counter
            jmp draw_line
        done:
        }
    }
}

Initial testing was good (when using debugging compiler options)! Then I switched to release optimizations and ran through more scenarios and noticed problems... eventually when I thought to look at the assembly output, I noticed this:

; void surface_blit_region_f(const SURFACE *src,
;                            SURFACE *dest,
;                            int src_x,
;                            int src_y,
;                            int src_width,
;                            int src_height,
;                            int dest_x,
;                            int dest_y) {
;     const byte *psrc;
;     byte *pdest;
;     int lines;
surface_blit_region_f_:
                push    esi
                push    edi
                push    ebp
                mov     ebp,esp
                sub     esp,0000001cH
L53:            mov     esi,ebx
                mov     ebx,ecx
                mov     ecx,dword ptr +10H[ebp]

;     int src_y_inc = src->width - src_width;
                mov     edi,dword ptr [eax]
                sub     edi,ecx
                mov     dword ptr -10H[ebp],edi

;     int dest_y_inc = dest->width - src_width;
;     int width_4, width_remainder;
; 
;     psrc = (const byte*)surface_pointer(src, src_x, src_y);
;     pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
                mov     edi,dword ptr [edx]
                sub     edi,ecx
                mov     dword ptr -0cH[ebp],edi
L54:            imul    ebx,dword ptr [eax]
                mov     eax,dword ptr +8H[eax]
                add     esi,ebx
                add     eax,esi
                mov     dword ptr -1cH[ebp],eax
L55:            mov     ebx,dword ptr +1cH[ebp]
                mov     eax,dword ptr [edx]
                imul    eax,ebx
                mov     esi,dword ptr +18H[ebp]
                mov     edx,dword ptr +8H[edx]
                add     eax,esi
                add     eax,edx
                mov     dword ptr -18H[ebp],eax

;     lines = src_height;
; 
                mov     eax,dword ptr +14H[ebp]
                mov     dword ptr -14H[ebp],eax
                mov     edx,ecx

;     width_4 = src_width / 4;
                mov     eax,ecx
                sar     edx,1fH
                shl     edx,02H
                sbb     eax,edx
                sar     eax,02H
                mov     dword ptr -8H[ebp],eax

;     width_remainder = src_width & 3;
; 
                and     ecx,00000003H
                mov     dword ptr -4H[ebp],ecx

;     if (width_4 && !width_remainder) {
;         // width is a multiple of 4 (no remainder)
                mov     edi,dword ptr -8H[ebp]
L56:            test    edi,edi
                je      short L57
                cmp     dword ptr -4H[ebp],00000000H
                jne     short L57

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_4     // eax = number of 4-pixel runs (dwords)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, eax         // draw all 4-pixel runs (dwords)
;             rep movsd
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
                je      short L58
                mov     ecx,eax
                repe    movsd    
                add     esi,dword ptr -10H[ebp]

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
                jmp     short L63
L57:            cmp     dword ptr -8H[ebp],00000000H
                je      short L61
                DB      83H,7dH,0fcH,00H
                je      short L61

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_4         // eax = number of 4-pixel runs (dwords)
;             mov ebx, width_remainder // ebx = remaining number of pixels
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, eax         // draw all 4-pixel runs (dwords)
;             rep movsd
;             mov ecx, ebx         // draw remaining pixels ( <= 3 bytes )
;             rep movsb
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     ebx,dword ptr -4H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L59:            je      short L60
                mov     ecx,eax
                repe    movsd    
                mov     ecx,ebx
                repe    movsb    
                add     esi,dword ptr -10H[ebp]
                add     edi,dword ptr -0cH[ebp]
                dec     edx
                jmp     short L59

;     } else {
;         // width is <= 3
L60:            jmp     short L64

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_remainder // ebx = number of pixels to draw (bytes)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, ebx         // draw pixels (bytes)
;             rep movsb
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
;         }
L61:            mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -4H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L62:            je      short L64
                mov     ecx,ebx
                repe    movsb    
                add     esi,dword ptr -10H[ebp]
L63:            add     edi,dword ptr -0cH[ebp]
                dec     edx
                jmp     short L62

; }
; 
L64:            mov     esp,ebp
                pop     ebp
                pop     edi
                pop     esi
                ret     0010H

At first glance, this may look fine. But look at the code for the first scenario blit within the width_4 && !width_remainder condition:

                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
                je      short L58
                mov     ecx,eax
                repe    movsd    
                add     esi,dword ptr -10H[ebp]

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
                jmp     short L63
L57:            cmp     dword ptr -8H[ebp],00000000H
                je      short L61
                DB      83H,7dH,0fcH,00H
                je      short L61

Uhh, what? The compiler just appeared to have chopped off the last bit of the blit loop and then headed on to the following else if. Well, what's at the L63 label that it jumps to...

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_remainder // ebx = number of pixels to draw (bytes)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, ebx         // draw pixels (bytes)
;             rep movsb
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
;         }
L61:            mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -4H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L62:            je      short L64
                mov     ecx,ebx
                repe    movsb    
                add     esi,dword ptr -10H[ebp]
L63:            add     edi,dword ptr -0cH[ebp]
                dec     edx
                jmp     short L62

Huh. It jumps into the end of the third blit scenario's loop. Of course, then it jumps back to L62 and continues running the wrong blit from that point on.

So obviously at this point the logical conclusion is that the compiler was mixed up because I had three _asm blocks with identical labels. This was actually something I had used elsewhere with no problems, as Watcom seems able to make labels within any _asm block unique to that block only. But what I was seeing here seemed to indicate that this was maybe not a bullet proof feature. So I changed all the labels to be uniquely named and noticed no change whatsoever! Huh?

A short while later I just decided to try adding nops at random places. Confusingly enough, that seemed to do the trick:

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_4     // eax = number of 4-pixel runs (dwords)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, eax         // draw all 4-pixel runs (dwords)
;             rep movsd
; 
;             nop
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L57:            je      short L58
                mov     ecx,eax
                repe    movsd    
                nop     
                add     esi,dword ptr -10H[ebp]
                add     edi,dword ptr -0cH[ebp]
                dec     edx
                jmp     short L57

I experimented with the placement of the nop a little more and it can be pretty much anywhere after L57 as shown above (and before the final jmp of course) and it "fixes" the problem. What the heck?

Another thing I tried was rearranging where I do the jz (or je):

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width_4     // eax = number of 4-pixel runs (dwords)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;             jz done              // if no more lines to draw, then we're done
;         draw_line:
;             mov ecx, ebx         // draw all 4-pixel runs (dwords)
;             rep movsd
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jz done              // if no more lines to draw, then we're done
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     ebx,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
                je      short L58
L57:            mov     ecx,ebx
                repe    movsd    
                add     esi,dword ptr -10H[ebp]
                add     edi,dword ptr -0cH[ebp]
                dec     edx
                je      short L58
                jmp     short L57

So that solves the problem too.

Of course, I don't like this at all. Why does this particular piece of code cause the compiler to mess up like this? I fear I may never know! One last thing I wanted to try was using the Watcom-recommended approach to inline assembly, and to use #pragma aux instead of _asm. This was also improved in 11.0 allowing you to also refer to your C variables just as I was doing here with _asm. Of course, the syntax is far uglier, but it does have the added benefit of allowing the compiler to stitch together your assembly with the surrounding code a bit better:

void surface_blit_region_f(const SURFACE *src,
                           SURFACE *dest,
                           int src_x,
                           int src_y,
                           int src_width,
                           int src_height,
                           int dest_x,
                           int dest_y) {
    const byte *psrc;
    byte *pdest;
    int lines;
    int src_y_inc = src->width - src_width;
    int dest_y_inc = dest->width - src_width;
    int width_4, width_remainder;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    lines = src_height;

    width_4 = src_width / 4;
    width_remainder = src_width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        extern void _inner_blit4(byte *dest, const byte *src, int width4, int lines);
        #pragma aux _inner_blit4 =       \
            "    test edx, edx"          \
            "draw_line:"                 \
            "    jz done"                \
            ""                           \
            "    mov ecx, eax"           \
            "    rep movsd"              \
            ""                           \
            "    add esi, src_y_inc"     \
            "    add edi, dest_y_inc"    \
            "    dec edx"                \
            "    jmp draw_line"          \
            "done:"                      \
            parm [edi] [esi] [eax] [edx] \
            modify [ecx];

        _inner_blit4(pdest, psrc, width_4, lines);

    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        extern void _inner_blit4r(byte *dest, const byte *src, int width4, int remainder, int lines);
        #pragma aux _inner_blit4r =            \
            "    test edx, edx"                \
            "draw_line:"                       \
            "    jz done"                      \
            ""                                 \
            "    mov ecx, eax"                 \
            "    rep movsd"                    \
            "    mov ecx, ebx"                 \
            "    rep movsb"                    \
            ""                                 \
            "    add esi, src_y_inc"           \
            "    add edi, dest_y_inc"          \
            "    dec edx"                      \
            "    jmp draw_line"                \
            "done:"                            \
            parm [edi] [esi] [eax] [ebx] [edx] \
            modify [ecx];

        _inner_blit4r(pdest, psrc, width_4, width_remainder, lines);

    } else {
        // width is <= 3
        extern void _inner_blitb(byte *dest, const byte *src, int width, int lines);
        #pragma aux _inner_blitb =       \
            "    test edx, edx"          \
            "draw_line:"                 \
            "    jz done"                \
            ""                           \
            "    mov ecx, ebx"           \
            "    rep movsb"              \
            ""                           \
            "    add esi, src_y_inc"     \
            "    add edi, dest_y_inc"    \
            "    dec edx"                \
            "    jmp draw_line"          \
            "done:"                      \
            parm [edi] [esi] [ebx] [edx] \
            modify [ecx];

        _inner_blitb(pdest, psrc, width_remainder, lines);
    }
}

And the relevant compiler generated output:

;     if (width_4 && !width_remainder) {
;         // width is a multiple of 4 (no remainder)
;         extern void _inner_blit4(byte *dest, const byte *src, int width4, int lines);
;         #pragma aux _inner_blit4 =       \
;             "    test edx, edx"          \
;             "draw_line:"                 \
;             "    jz done"                \
;             ""                           \
;             "    mov ecx, eax"           \
;             "    rep movsd"              \
;             ""                           \
;             "    add esi, src_y_inc"     \
;             "    add edi, dest_y_inc"    \
;             "    dec edx"                \
;             "    jmp draw_line"          \
;             "done:"                      \
;             parm [edi] [esi] [eax] [edx] \
;             modify [ecx];
; 
L56:            test    eax,eax
                je      short L57
                test    ebx,ebx
                jne     short L57

;         _inner_blit4(pdest, psrc, width_4, lines);
; 
                mov     edx,edi
                mov     edi,ecx
                test    edx,edx
                je      short L58
                mov     ecx,eax
                repe    movsd    
                add     esi,dword ptr -8H[ebp]

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
;         extern void _inner_blit4r(byte *dest, const byte *src, int width4, int remainder, int lines);
;         #pragma aux _inner_blit4r =            \
;             "    test edx, edx"                \
;             "draw_line:"                       \
;             "    jz done"                      \
;             ""                                 \
;             "    mov ecx, eax"                 \
;             "    rep movsd"                    \
;             "    mov ecx, ebx"                 \
;             "    rep movsb"                    \
;             ""                                 \
;             "    add esi, src_y_inc"           \
;             "    add edi, dest_y_inc"          \
;             "    dec edx"                      \
;             "    jmp draw_line"                \
;             "done:"                            \
;             parm [edi] [esi] [eax] [ebx] [edx] \
;             modify [ecx];
; 
                jmp     short L63
L57:            test    eax,eax
                je      short L61
                test    ebx,ebx
                DB      74H,21H

So, still results in the same bug. Bleh.

Just to take another opportunity to dump a bunch more code in this post, here's my surface_blit_sprite_region_f function which is basically the same idea as surface_blit_region_f, except that as it's name suggests, it deals with transparency and skips over source pixels that are colour zero. But the same idea of splitting it up into three separate blit loops for the different scenarios outlined above is still there, complete with three separate _asm blocks:

void surface_blit_sprite_region_f(const SURFACE *src,
                                  SURFACE *dest,
                                  int src_x,
                                  int src_y,
                                  int src_width,
                                  int src_height,
                                  int dest_x,
                                  int dest_y) {
    const byte *psrc;
    byte *pdest;
    byte pixel;
    int src_y_inc, dest_y_inc;
    int width, width_4, width_remainder;
    int lines_left;
    int x;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    src_y_inc = src->width;
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    dest_y_inc = dest->width;
    width = src_width;
    lines_left = src_height;
    src_y_inc -= width;
    dest_y_inc -= width;

    width_4 = width / 4;
    width_remainder = width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_4      // get number of 4-pixel runs to be drawn
            mov ecx, lines_left
            test ecx, ecx         // make sure there is >0 lines to be drawn
draw_line:
            jz done

start_4_run:
            mov edx, ebx          // dx = counter of 4-pixel runs left to draw
draw_px_0:
            mov al, [esi]+0       // load src pixel
            test al, al
            jz draw_px_1          // if it is color 0, skip it
            mov [edi]+0, al       // otherwise, draw it onto dest
draw_px_1:
            mov al, [esi]+1
            test al, al
            jz draw_px_2
            mov [edi]+1, al
draw_px_2:
            mov al, [esi]+2
            test al, al
            jz draw_px_3
            mov [edi]+2, al
draw_px_3:
            mov al, [esi]+3
            test al, al
            jz end_4_run
            mov [edi]+3, al
end_4_run:
            add esi, 4            // move src and dest up 4 pixels
            add edi, 4
            dec edx               // decrease 4-pixel run loop counter
            jnz draw_px_0         // if there are still more runs, draw them

end_line:
            add esi, src_y_inc    // move src and dest to start of next line
            add edi, dest_y_inc
            dec ecx               // decrease line loop counter
            jmp draw_line
done:
        }


    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_4      // get number of 4-pixel runs to be drawn
            mov ecx, lines_left
            test ecx, ecx         // make sure there is >0 lines to be drawn
draw_line:
            jz done

            test ebx, ebx
            jz start_remainder_run // if no 4-pixel runs, just draw remainder

start_4_run:                      // draw 4-pixel runs first
            mov edx, ebx          // dx = counter of 4-pixel runs left to draw
draw_px_0:
            mov al, [esi]+0       // load src pixel
            test al, al
            jz draw_px_1          // if it is color 0, skip it
            mov [edi]+0, al       // otherwise, draw it onto dest
draw_px_1:
            mov al, [esi]+1
            test al, al
            jz draw_px_2
            mov [edi]+1, al
draw_px_2:
            mov al, [esi]+2
            test al, al
            jz draw_px_3
            mov [edi]+2, al
draw_px_3:
            mov al, [esi]+3
            test al, al
            jz end_4_run
            mov [edi]+3, al
end_4_run:
            add esi, 4            // move src and dest up 4 pixels
            add edi, 4
            dec edx               // decrease 4-pixel run loop counter
            jnz draw_px_0         // if there are still more runs, draw them

start_remainder_run:              // now draw remaining pixels ( <= 3 pixels )
            mov edx, width_remainder // dx = counter of remaining pixels
            test edx, edx
            jz end_line           // if no remaining pixels, goto line end

draw_pixel:
            mov al, [esi]         // load pixel
            inc esi
            test al, al           // if zero, skip to next pixel
            jz end_pixel
            mov [edi], al         // else, draw pixel
end_pixel:
            inc edi
            dec edx
            jz end_line           // loop while (x)
            jmp draw_pixel

end_line:
            add esi, src_y_inc    // move src and dest to start of next line
            add edi, dest_y_inc
            dec ecx               // decrease line loop counter
            jmp draw_line
done:
        }

    } else {
        // width is <= 3
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width        // get number of pixels to be drawn
            mov ecx, lines_left
            test ecx, ecx         // make sure there is >0 lines to be drawn
draw_line:
            jz done

            mov edx, ebx          // dx = counter of remaining pixels
draw_pixel:
            mov al, [esi]         // load pixel
            inc esi
            test al, al           // if zero, skip to next pixel
            jz end_pixel
            mov [edi], al         // else, draw pixel
end_pixel:
            inc edi
            dec edx
            jz end_line           // loop while (x)
            jmp draw_pixel

end_line:
            add esi, src_y_inc    // move src and dest to start of next line
            add edi, dest_y_inc
            dec ecx               // decrease line loop counter
            jmp draw_line
done:
        }
    }
}

And the code that the compiler generated:

; void surface_blit_sprite_region_f(const SURFACE *src,
;                                   SURFACE *dest,
;                                   int src_x,
;                                   int src_y,
;                                   int src_width,
;                                   int src_height,
;                                   int dest_x,
;                                   int dest_y) {
;     const byte *psrc;
;     byte *pdest;
;     byte pixel;
;     int src_y_inc, dest_y_inc;
;     int width, width_4, width_remainder;
;     int lines_left;
;     int x;
; 
;     psrc = (const byte*)surface_pointer(src, src_x, src_y);
surface_blit_sprite_region_f_:
                push    esi
                push    edi
                push    ebp
                mov     ebp,esp
                sub     esp,00000020H
L69:            mov     esi,dword ptr [eax]
                imul    ecx,esi
                add     ebx,ecx
                mov     ecx,dword ptr +8H[eax]
                add     ecx,ebx
                mov     dword ptr -20H[ebp],ecx

;     src_y_inc = src->width;
;     pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
                mov     dword ptr -18H[ebp],esi
L70:            mov     edi,dword ptr +1cH[ebp]
                mov     eax,dword ptr [edx]
                imul    eax,edi
                add     eax,dword ptr +18H[ebp]
                mov     ecx,dword ptr +8H[edx]
                add     eax,ecx
                mov     dword ptr -1cH[ebp],eax

;     dest_y_inc = dest->width;
                mov     eax,dword ptr [edx]
                mov     dword ptr -14H[ebp],eax

;     width = src_width;
                mov     eax,dword ptr +10H[ebp]
                mov     dword ptr -10H[ebp],eax

;     lines_left = src_height;
                mov     eax,dword ptr +14H[ebp]
                mov     dword ptr -4H[ebp],eax

;     src_y_inc -= width;
                mov     eax,dword ptr -10H[ebp]
                sub     dword ptr -18H[ebp],eax

;     dest_y_inc -= width;
; 
                mov     eax,dword ptr -10H[ebp]
                sub     dword ptr -14H[ebp],eax

;     width_4 = width / 4;
                mov     eax,dword ptr -10H[ebp]
                mov     edx,dword ptr -10H[ebp]
                sar     edx,1fH
                shl     edx,02H
                sbb     eax,edx
                sar     eax,02H
                mov     dword ptr -0cH[ebp],eax

;     width_remainder = width & 3;
; 
                mov     eax,dword ptr -10H[ebp]
                and     eax,00000003H
                mov     dword ptr -8H[ebp],eax

;     if (width_4 && !width_remainder) {
;         // width is a multiple of 4 (no remainder)
                mov     edi,dword ptr -0cH[ebp]
L71:            test    edi,edi
                je      short L79
                cmp     dword ptr -8H[ebp],00000000H
                jne     short L79

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width_4      // get number of 4-pixel runs to be drawn
;             mov ecx, lines_left
;             test ecx, ecx         // make sure there is >0 lines to be drawn
; draw_line:
;             jz done
; 
; start_4_run:
;             mov edx, ebx          // dx = counter of 4-pixel runs left to draw
; draw_px_0:
;             mov al, [esi]+0       // load src pixel
;             test al, al
;             jz draw_px_1          // if it is color 0, skip it
;             mov [edi]+0, al       // otherwise, draw it onto dest
; draw_px_1:
;             mov al, [esi]+1
;             test al, al
;             jz draw_px_2
;             mov [edi]+1, al
; draw_px_2:
;             mov al, [esi]+2
;             test al, al
;             jz draw_px_3
;             mov [edi]+2, al
; draw_px_3:
;             mov al, [esi]+3
;             test al, al
;             jz end_4_run
;             mov [edi]+3, al
; end_4_run:
;             add esi, 4            // move src and dest up 4 pixels
;             add edi, 4
;             dec edx               // decrease 4-pixel run loop counter
;             jnz draw_px_0         // if there are still more runs, draw them
; 
; end_line:
;             add esi, src_y_inc    // move src and dest to start of next line
;             add edi, dest_y_inc
;             dec ecx               // decrease line loop counter
;             jmp draw_line
; done:
;         }
; 
; 
                mov     esi,dword ptr -20H[ebp]
                mov     edi,dword ptr -1cH[ebp]
                mov     ebx,dword ptr -0cH[ebp]
                mov     ecx,dword ptr -4H[ebp]
                test    ecx,ecx
L72:            je      short L78
                mov     edx,ebx
L73:            mov     al,byte ptr [esi]
                test    al,al
                je      short L74
                mov     byte ptr [edi],al
L74:            mov     al,byte ptr +1H[esi]
                test    al,al
                je      short L75
                mov     byte ptr +1H[edi],al
L75:            mov     al,byte ptr +2H[esi]
                test    al,al
                je      short L76
                mov     byte ptr +2H[edi],al
L76:            mov     al,byte ptr +3H[esi]
                test    al,al
                je      short L77
                mov     byte ptr +3H[edi],al
L77:            add     esi,00000004H
                add     edi,00000004H
                dec     edx
                jne     short L73
                add     esi,dword ptr -18H[ebp]
                add     edi,dword ptr -14H[ebp]
                dec     ecx
                jmp     short L72

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
L78:            jmp     near ptr L96
L79:            cmp     dword ptr -0cH[ebp],00000000H
                je      near ptr L91
                cmp     dword ptr -8H[ebp],00000000H
                je      near ptr L91

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width_4      // get number of 4-pixel runs to be drawn
;             mov ecx, lines_left
;             test ecx, ecx         // make sure there is >0 lines to be drawn
; draw_line:
;             jz done
; 
;             test ebx, ebx
;             jz start_remainder_run // if no 4-pixel runs, just draw remainder
; 
; start_4_run:                      // draw 4-pixel runs first
;             mov edx, ebx          // dx = counter of 4-pixel runs left to draw
; draw_px_0:
;             mov al, [esi]+0       // load src pixel
;             test al, al
;             jz draw_px_1          // if it is color 0, skip it
;             mov [edi]+0, al       // otherwise, draw it onto dest
; draw_px_1:
;             mov al, [esi]+1
;             test al, al
;             jz draw_px_2
;             mov [edi]+1, al
; draw_px_2:
;             mov al, [esi]+2
;             test al, al
;             jz draw_px_3
;             mov [edi]+2, al
; draw_px_3:
;             mov al, [esi]+3
;             test al, al
;             jz end_4_run
;             mov [edi]+3, al
; end_4_run:
;             add esi, 4            // move src and dest up 4 pixels
;             add edi, 4
;             dec edx               // decrease 4-pixel run loop counter
;             jnz draw_px_0         // if there are still more runs, draw them
; 
; start_remainder_run:              // now draw remaining pixels ( <= 3 pixels )
;             mov edx, width_remainder // dx = counter of remaining pixels
;             test edx, edx
;             jz end_line           // if no remaining pixels, goto line end
; 
; draw_pixel:
;             mov al, [esi]         // load pixel
;             inc esi
;             test al, al           // if zero, skip to next pixel
;             jz end_pixel
;             mov [edi], al         // else, draw pixel
; end_pixel:
;             inc edi
;             dec edx
;             jz end_line           // loop while (x)
;             jmp draw_pixel
; 
; end_line:
;             add esi, src_y_inc    // move src and dest to start of next line
;             add edi, dest_y_inc
;             dec ecx               // decrease line loop counter
;             jmp draw_line
; done:
;         }
; 
;     } else {
;         // width is <= 3
                mov     esi,dword ptr -20H[ebp]
                mov     edi,dword ptr -1cH[ebp]
                mov     ebx,dword ptr -0cH[ebp]
                mov     ecx,dword ptr -4H[ebp]
                test    ecx,ecx
L80:            je      short L90
                test    ebx,ebx
                je      short L86
                mov     edx,ebx
L81:            mov     al,byte ptr [esi]
                test    al,al
                je      short L82
                mov     byte ptr [edi],al
L82:            mov     al,byte ptr +1H[esi]
                test    al,al
                je      short L83
                mov     byte ptr +1H[edi],al
L83:            mov     al,byte ptr +2H[esi]
                test    al,al
                je      short L84
                mov     byte ptr +2H[edi],al
L84:            mov     al,byte ptr +3H[esi]
                test    al,al
                je      short L85
                mov     byte ptr +3H[edi],al
L85:            add     esi,00000004H
                add     edi,00000004H
                dec     edx
                jne     short L81
L86:            mov     edx,dword ptr -8H[ebp]
                test    edx,edx
                je      short L89
L87:            mov     al,byte ptr [esi]
                inc     esi
                test    al,al
                je      short L88
                mov     byte ptr [edi],al
L88:            inc     edi
                dec     edx
                je      short L89
                jmp     short L87
L89:            add     esi,dword ptr -18H[ebp]
                add     edi,dword ptr -14H[ebp]
                dec     ecx
                jmp     short L80
L90:            mov     esp,ebp
                pop     ebp
                pop     edi
                pop     esi
                ret     0010H

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width        // get number of pixels to be drawn
;             mov ecx, lines_left
;             test ecx, ecx         // make sure there is >0 lines to be drawn
; draw_line:
;             jz done
; 
;             mov edx, ebx          // dx = counter of remaining pixels
; draw_pixel:
;             mov al, [esi]         // load pixel
;             inc esi
;             test al, al           // if zero, skip to next pixel
;             jz end_pixel
;             mov [edi], al         // else, draw pixel
; end_pixel:
;             inc edi
;             dec edx
;             jz end_line           // loop while (x)
;             jmp draw_pixel
; 
; end_line:
;             add esi, src_y_inc    // move src and dest to start of next line
;             add edi, dest_y_inc
;             dec ecx               // decrease line loop counter
;             jmp draw_line
; done:
;         }
;     }
L91:            mov     esi,dword ptr -20H[ebp]
                mov     edi,dword ptr -1cH[ebp]
                mov     ebx,dword ptr -10H[ebp]
                mov     ecx,dword ptr -4H[ebp]
                test    ecx,ecx
L92:            je      short L96
                mov     edx,ebx
L93:            mov     al,byte ptr [esi]
                inc     esi
                test    al,al
                je      short L94
                mov     byte ptr [edi],al
L94:            inc     edi
                dec     edx
                je      short L95
                jmp     short L93
L95:            add     esi,dword ptr -18H[ebp]
                add     edi,dword ptr -14H[ebp]
                dec     ecx
                jmp     short L92

; }
; 
L96:            mov     esp,ebp
                pop     ebp
                pop     edi
                pop     esi
                ret     0010H

surface_blit_sprite_region_f passes the exact same set of tests (under both, debug and release optimizations) that I ran surface_blit_region_f through with no problems at all. Aarrghh!

So what is the takeaway? Well, I won't be relying on inline assembly anymore. Realistically, I wasn't planning on having gobs of assembly in libDGL anyway. Just for some inner loops and such that I wanted to make sure were running as lean as possible. Instead of writing it all inline, I'll just be moving it out to separate assembly source files and using either WASM or TASM.