Using Watcom's Register-based Calling Convention With TASM

I suppose I'm writing this post for my own benefit primarily. I'll likely forget many of these details in a month, and then go and try to write a bunch more assembly and run into problems. So I'll try to proactively solve that future problem for myself. Everything here is better documented in the compiler documentation. However, it is scattered around a bit and of course isn't written with specific examples for using TASM.

One of the performance benefits that Watcom brought with it that was a pretty big deal at the time was that it's default calling convention used registers for up to the first 4 arguments to called functions. Past that, and the stack would be used as per standard C calling conventions.

As mentioned this calling convention is the default, but it can be globally changed via the CPU instruction code generation compiler switch. For example, /3 and /3r both select 386 instructions with register-based calling convention, while /3s selects 386 instructions with stack-based calling convention.

Borland Turbo Assembler (TASM) does not natively support this register-based calling convention among it's varied support for programming-language specific calling conventions. However it does let you use it's "NOLANGUAGE" option (which is the default if no language is specified) and then you can handle all the details yourself.

ideal

p386  
model flat  
codeseg

locals

public add_numbers_

; int add_numbers(int a, int b)
; inputs:
;   eax = a
;   edx = b
; return:
;   eax
proc add_numbers_ near  
    push ebp
    mov ebp, esp

    add eax, edx

    pop ebp
    ret
    endp

end  

This is pretty normal looking TASM. Complete with normal looking assembly prologue and epilogue code. Note that we are intentionally not specifying a language modifier.

So, first off, add_numbers_ has a trailing underscore to match what Watcom expects by default. If you don't like this for whatever reason, you can change the name here to your liking, but the use of a #pragma in your C code is necessary to inform Watcom about the different naming convention for this function.

Second, via the magic of the register-based calling convention, Watcom will have our two number arguments all ready for us in eax and edx. Our return value is assumed to be in eax, and that is correct in our case so we're all good.

The great thing is, we don't actually need to do anything fancy to call this function from our C code.

// prototype
int add_numbers(int a, int b);

// usage
int result;  
result = add_numbers(10, 20);  

But that was the simple case.

This register-based calling convention actually places the burden on the called function to clean things up before returning. This includes preserving some register values as well. According to the documentation: "All used 80x86 registers must be saved on entry and restored on exit except those used to pass arguments and return values." So, in our add_numbers_ function if we had wanted to use ecx, we would need to push and pop it during the prologue and epilogue code. But we didn't need to do so for eax and edx because those were used to pass arguments and return a value.

As mentioned previously, the stack gets used for arguments once all the registers have been used for arguments (by default, eax, edx, ebx, ecx in that order). In this case, the called function is responsible for popping them off the stack when it returns. So, if there were two int arguments that were passed on the stack, we would need to do a ret 8 to return.

; For this function, using the default register calling convention, the first 4 arguments
; will be passed in registers eax, edx, ebx and ecx. The last two will be passed on the stack.

; void direct_blit_4(int width4,
;                    int lines,
;                    byte *dest,
;                    byte *src,
;                    int dest_y_inc,
;                    int src_y_inc);
proc direct_blit_4_ near  
arg @@dest_y_inc:dword, @@src_y_inc:dword  
    push ebp
    mov ebp, esp  ; don't try to be clever and move this elsewhere!
    push edi      ; likewise, don't try to group the push's all together!
    push esi

    ; code here (that also modifies edi and esi, thus the additional pushs/pops)

    pop esi
    pop edi
    pop ebp
    ret 8
    endp

Is this all too cumbersome to worry about? Well, I don't really think it's a big deal, but there is a way we can remove ourselves from this burden.

Let's say we didn't want to have to worry about preserving any of eax, ebx, ecx, edx, edi, or esi regardless of how many arguments our function has and what (if any) return value it uses. Also, maybe we don't want to have to worry about popping arguments off the stack ourselves when our assembly functions return.

// define our "asmcall" calling convention
#pragma aux asmcall parm caller \
                    modify [eax ebx ecx edx edi esi];

#pragma aux (asmcall) add_numbers;
int add_numbers(int a, int b);       // no change to the function prototype is necessary  

What if we actually wanted to use the normal C stack-based calling convention for our assembly functions and ignore this register argument nonsense? Maybe you're using an existing library and it was written for other compilers that don't use this register-based calling convention.

#pragma aux asmstackcall parm caller [] \
                         modify [eax ebx ecx edx edi esi];

Watcom also pre-defines the cdecl symbol for this same purpose, which you can and probably should use instead of defining your own.

The empty brackets [] denotes an empty register set to be used for parameter passing. That is, we are saying not to use any registers, so the stack is used instead for all of them. With that in mind, we could expand the set of default registers used for parameter passing:

#pragma aux asmcallmorereg parm caller [eax edx ebx ecx edi esi] \
                           modify [eax ebx ecx edx edi esi];

In this case the modify list is redundant and need not be specified.

Of course, saying that your function will use/modify more registers means that the compiler has to work around it before and after calls to your assembly function which may result in less optimal code being generated. There's always a trade off!

None of the above #pragmas remove the need for the standard prologue and epilogue code that you've seen a thousand times before:

push ebp  
mov ebp, esp  
; ...
pop ebp  

The only exception is if your assembly function isn't using the stack at all.

There are many details I've left out. For example, passing double values will mean two registers will get used for one argument because doubles are 8 bytes. But if you only have one register left (maybe you passed 3 ints first), then the double value will get passed on the stack instead. Additionally there are more details to know when passing/returning structs. But I'm not doing any of this right now, so I've not really looked into it beyond a passing glance.

Attempts At Optimizing VGA Mode 13h BitBlts and Tilemap Rendering

As my last post indicated (which has been a while now, whoops!), I've been working on optimizing the bitblt routines in libDGL. I'm definitely no master of optimization and am not expecting to come up with anything revolutionary. In fact I'm sure I will goof things up a fair bit in the process and miss obvious avenues of optimization, heh. Every little bit of speed counts for the hardware I'm targeting with this project (486's and maybe 386's later on). Figured I'd share the results of some of my latest attempts, including the parts that didn't result in improvements.

Recently, I thought it would be a fun idea to do a pretty much 1:1 conversion of some old projects of mine written around the turn of the century. The idea would be to convert them to C and get them up and running with libDGL. The code in these projects was pretty terrible and I've no long-term intentions of extending them. However, I figured it would be interesting mainly just to see how fast I could get them running.

Even back then, I liked using obsolete development tools like QuickBASIC 4.5 which by 2000/2001 was definitely obsolete. At that time I was writing code on a AMD Duron 800MHz PC, so I was never too concerned with performance, even with QuickBASIC. Running this old QuickBASIC code today as-is on my 486 DX2-66, I see that it barely maintains 30-35 FPS. This code was built using DirectQB 1.61 (a great library for the time, used by a lot of games) with some improved tile/sprite bitblt routines by Rel since the original routines in DirectQB were known to be kind of slow.

Of course, just by doing a straight-up conversion to C we will get significant performance gains. QuickBASIC isn't exactly an optimizing compiler, heh. But I wanted to see just how fast we could get the basic tilemap rendering going.

Here's the original QuickBASIC code for drawing the tilemap:

SUB DrawMap  
  'Get camera position
  CameraX = Engine.x - 160
  CameraY = Engine.y - 96

  'Make sure we aren't going to go off the map buffer
  F CameraX < 0 THEN CameraX = 0
  IF CameraY < 0 THEN CameraY = 0
  IF CameraX > Engine.MaxX - 320 THEN CameraX = Engine.MaxX - 320
  IF CameraY > Engine.MaxY - 200 THEN CameraY = Engine.MaxY - 200

  'Get the starting tile to draw at
  xTile = CameraX \ 16
  yTile = CameraY \ 16

  'Get the pixel offset to draw at
  xpos = CameraX MOD 16
  ypos = CameraY MOD 16

  'Now actually draw the map
  FOR x = 0 TO 21
    FOR y = 0 TO 14
      'Get the tile numbers to draw
      tile = map(x + xTile, y + yTile).tile
      tile2 = map(x + xTile, y + yTile).tile2

      xp = x * 16 - xpos
      yp = y * 16 - ypos

      'Draw the first layer
      RelSpriteSolid 1, xp, yp, VARSEG(tilearray(0, tile)), VARPTR(tilearray(0, tile))

      'Draw the second layer only if the tile isn't equal to 0
      '(Tile 0 should be a blank tile, in which case we don't want to draw
      'it because it'll slow the engine down)
      IF tile2 <> 0 THEN
        RelSprite 1, xp, yp, VARSEG(tilearray(0, tile2)), VARPTR(tilearray(0, tile2))
      END IF
    NEXT y
  NEXT x
END SUB  

So after doing a bunch of straightforward converting and getting the basic engine up and running, I ended up with the following C function. This is, for now, intended to be a 1:1 equivalent of the above code:

#define SCREEN_X_TILES  21
#define SCREEN_Y_TILES  14

#define TILE_WIDTH      16
#define TILE_HEIGHT     16

void draw_map(void) {  
    int x_tile, y_tile;
    int x_offs, y_offs;
    int x, y, index;
    int xp, yp;
    byte tile1, tile2;
    SURFACE **tileset;

    // get camera position
    engine->camera_x = engine->x - 160;
    engine->camera_y = engine->y - 96;

    // make sure we aren't going to go off the map buffer
    if (engine->camera_x < 0)
        engine->camera_x = 0;
    if (engine->camera_y < 0)
        engine->camera_y = 0;
    if (engine->camera_x > engine->width - 320)
        engine->camera_x = engine->width - 320;
    if (engine->camera_y > engine->height - 200)
        engine->camera_y = engine->height - 200;

    // get the starting tile to draw at
    x_tile = engine->camera_x / TILE_WIDTH;
    y_tile = engine->camera_y / TILE_HEIGHT;

    // get the pixel offset to draw at
    x_offs = engine->camera_x % TILE_WIDTH;
    y_offs = engine->camera_y % TILE_HEIGHT;

    // now actually draw the map
    tileset = map_tiles->tiles;
    for (y = 0; y < SCREEN_Y_TILES; ++y) {
        for (x = 0; x < SCREEN_X_TILES; ++x) {
            index = (((y + y_tile) * map->width) + (x + x_tile)) * 2;
            tile1 = map->tiledata[index];
            tile2 = map->tiledata[index + 1];

            xp = x * TILE_WIDTH - x_offs;
            yp = y * TILE_HEIGHT - y_offs;

            surface_blit(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
        }
    }
}

Definitely room for improvement here.

So how fast is it? Well, I was impressed actually. Mostly. There are two scenarios that are important to benchmark:

Fast scenario
Slow scenario

Just ignore the obviously ripped graphics. Younger-me couldn't draw pixel art, but I sure knew how to rip graphics from SNES ROMs! There was some great DOS-based tool I remember I really liked for doing it, but unfortunately I cannot recall the name of it. Anyway, Today-me still doesn't know how to draw pixel art, so we'll just continue using these ripped graphics for now.

So the difference between these two scenarios is the amount of "layer 2" tiles being drawn. You can see in the above render loop code that a check is done for a non-zero tile2 value and then a call to surface_blit_sprite is done. This of course draws the second layer tile with transparency to overlay tiles on top of the lower layer. As you can see in the right screenshot, there are a ton of tree tiles being drawn and with the map that the engine is using for this test, these are all on layer two (with layer one just being a simple grass tile).

Let's get the blatantly obvious problem out of the way first: There is NO need for the map to be laid out this way. We don't even need to touch code to see significant improvements. We just need to adjust the tileset and map so that we can skip the vast majority of the layer 2 tiles. If we needed a bunch of variations of tree tiles with different ground underneath them, then we might very well be better served with a bunch of different tree tiles in the tileset we use, each with the different ground needed. However, this map was what I put together in 2001 or so. For now we'll just stick with it and see what else we can do.

Also, I should mention that these two screenshots show the frame-rates at double-word aligned memory offsets. We'll come back to this later though as memory alignment is obviously an important topic.

So, the first thing that came to mind when looking at my freshly converted draw_map() function was to eliminate the unnecessary clipping checks for 90% of the tiles being drawn in this render loop. Only tiles on the edge of the screen actually need to be checked for clipping. We can do this very simply for now:

if (x > 0 &&  
    y > 0 && 
    x < (SCREEN_X_TILES - 1) && 
    y < (SCREEN_Y_TILES - 2)) {
    surface_blit_f(tileset[tile1], backbuffer, xp, yp);
    if (tile2)
        surface_blit_sprite_f(tileset[tile2], backbuffer, xp, yp);
} else {
    surface_blit(tileset[tile1], backbuffer, xp, yp);
    if (tile2)
        surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
}

surface_blit_f and surface_blit_sprite_f are "fast" versions that skip clipping checks, but otherwise work exactly the same (in fact, surface_blit and surface_blit_sprite call the fast versions internally).

This gets us a little bit of an improvement. 147/148 FPS in the fast scenario, and 97/98 FPS in the slow scenario.

It's important to note here that in VGA mode 13h, the screen resolution is 320x200. We're using 16x16 tiles in our tilemap, so we end up needing to draw a minimum of 20 tiles horizontally, and 13 tiles vertically (where the last row at the bottom will only be half visible, since 200/16 = 12.5). In order to do pixel-by-pixel scrolling as in this tilemap rendering engine, we need to add one extra column and row to ensure we don't have any gaps anywhere along the edges at any point as the screen is scrolling. Because of the uneven vertical tile count (12.5) we actually need to do clipping for the bottom two rows.

We could of course add a check for y_offs >= 8 to determine if we actually need to draw that last row at all. Though this obviously won't always improve performance, it would depend on how the screen is currently scrolled.

Anyway, the next thought I had was to improve the way that the map data was being accessed inside the loop. I didn't figure that this would make a big difference, but let's see:

// now actually draw the map
index = (y_tile * map->width) + x_tile;  
tiledata = &map->tiledata[index * 2];  
tileset = map_tiles->tiles;

xp = -y_offs;  
yp = -x_offs;

for (y = 0; y < SCREEN_Y_TILES; ++y) {  
    for (x = 0; x < SCREEN_X_TILES; ++x) {
        tile1 = tiledata[0];
        tile2 = tiledata[1]);

        if (x > 0 && 
            y > 0 && 
            x < (SCREEN_X_TILES - 1) && 
            y < (SCREEN_Y_TILES - 2)) {
            surface_blit_f(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite_f(tileset[tile2], backbuffer, xp, yp);
        } else {
            surface_blit(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
        }

        tiledata += 2;
        xp += TILE_WIDTH;
    }

    tiledata += (map->width - SCREEN_X_TILES) * 2;
    yp += TILE_HEIGHT;
    xp = -x_offs;
}

We get a very minor boost from this. 151/152 FPS in the fast scenario, and 98/99 FPS in the slow scenario. But it's something!

Well, now we have that x/y coordinate check that runs every iteration of the loop to see if we need to use clipped blits or not. We know that 90% of the screen does not need clipped blits, so I decided to try spliting up the rendering loop on this basis so that we don't need to run that check all the time. I ended up with 3 separate loops:

  • The top and bottom row (for y = 0, y = 12 and y = 13 only, remember two bottom rows can get clipped)
  • The left and right columns (for x = 0 and x = 20 only)
  • Everything in-between.

In the end this gained me about 1 FPS, but it made the code significantly larger because there were three loops instead of one, and each of these included similar sets of calculations (but different enough that I did need to have three sets of them). Probably could have cleaned the code up a fair bit, but ultimately since this all made a tiny difference, I decided that the extra complexity just wasn't worth keeping. Perhaps I would want to revisit this later on once I see how this runs on a 386 CPU.

Next, I decided to take a closer look at what the actual surface_blit and surface_blit_sprite calls do. I've already spent a bunch of time trying to optimize them and I'm sure there's still some stuff that can be done, but I'll start off with them as they are today and not the much slower versions I had written a few months ago.

static void surface_blit(const SURFACE *src, SURFACE *dest, int x, int y) {  
    surface_blit_region(src, dest, 0, 0, src->width, src->height, x, y);
}

static void surface_blit_sprite(const SURFACE *src, SURFACE *dest, int x, int y) {  
    surface_blit_sprite_region(src, dest, 0, 0, src->width, src->height, x, y);
}

void surface_blit_region(const SURFACE *src,  
                         SURFACE *dest,
                         int src_x,
                         int src_y,
                         int src_width,
                         int src_height,
                         int dest_x,
                         int dest_y) {
    RECT src_region = rect(src_x, src_y, src_width, src_height);
    boolean on_screen = clip_blit(&dest->clip_region, &src_region, &dest_x, &dest_y);

    if (!on_screen)
        return;

    surface_blit_region_f(src, dest,
                          src_region.x, src_region.y,
                          src_region.width, src_region.height,
                          dest_x, dest_y);
}

void surface_blit_sprite_region(const SURFACE *src,  
                                SURFACE *dest,
                                int src_x,
                                int src_y,
                                int src_width,
                                int src_height,
                                int dest_x,
                                int dest_y) {
    RECT src_region = rect(src_x, src_y, src_width, src_height);
    boolean on_screen = clip_blit(&dest->clip_region, &src_region, &dest_x, &dest_y);

    if (!on_screen)
        return;

    surface_blit_sprite_region_f(src, dest,
                                 src_region.x, src_region.y,
                                 src_region.width, src_region.height,
                                 dest_x, dest_y);
}

Alright, so as we can see, these aren't super interesting and we might as well just look at surface_blit_region_f and surface_blit_sprite_region_f. We could likely optimize clip_blit. In fact, I don't even think I've tried to do this at all ever. The existing implementation is a copy of some code I had written over 10 years ago if I recall correctly, heh. However, let's just ignore it for now since the current tilemap function we have skips clipping for probably 90% of the tiles being drawn.

static int surface_offset(const SURFACE *surface, int x, int y) {  
    return (surface->width * y) + x;
}

static byte* surface_pointer(const SURFACE *surface, int x, int y) {  
    return surface->pixels + surface_offset(surface, x, y);
}

void surface_blit_region_f(const SURFACE *src,  
                           SURFACE *dest,
                           int src_x,
                           int src_y,
                           int src_width,
                           int src_height,
                           int dest_x,
                           int dest_y) {
    const byte *psrc;
    byte *pdest;
    int lines;
    int src_y_inc = src->width - src_width;
    int dest_y_inc = dest->width - src_width;
    int width_4, width_remainder;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    lines = src_height;

    width_4 = src_width / 4;
    width_remainder = src_width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        direct_blit_4(width_4, lines, pdest, psrc, dest_y_inc, src_y_inc);

    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        direct_blit_4r(width_4, lines, width_remainder, pdest, psrc, dest_y_inc, src_y_inc);

    } else {
        // width is <= 3
        direct_blit_r(width_remainder, lines, pdest, psrc, dest_y_inc, src_y_inc);
    }
}

I talked about this in my last post, but to recap, the idea here is that I figured there were three main scenarios for bitblts (post-clipping of course):

  • The width of the blit is an even multiple of 4. In this case, we can simply do a rep movsd' for each row. Very nice and efficient.
  • The width is some value larger than 4, but it is not an even multiple of 4 so we can split each row into a rep movsd followed by a rep movsb.
  • The width is some value < 4. We can just do a rep movsb.

The most common scenario when dealing with "typical" game graphics would be the first scenario. In our case, we are using 16x16 tiles and sprites, so definitely this will be the case for us. The remaining two scenarios would primarily occur for partially clipped blits, so these two would not be what would get run for the vast majority of blits.

It's worth pointing out that I didn't start with a blit function implementation that had these three scenarios. I started with a simple one that just did rep movsb for each row of pixels. Once I saw how that performed I then thought about it and came up with the three scenario idea. I then saw that, indeed, this way performed much better. Having said that, I'm sure it's been implemented better by smarter people then me decades earlier.

The direct_blit_xxxx calls are implemented in assembly and are relatively simple:

void direct_blit_4(int width4,  
                   int lines,
                   byte *dest,
                   const byte *src,
                   int dest_y_inc,
                   int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw all 4-pixel runs (dwords)
        rep movsd

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_4r(int width4,  
                    int lines,
                    int remainder,
                    byte *dest,
                    const byte *src,
                    int dest_y_inc,
                    int src_y_inc) {
    _asm {
        mov edi, ecx             // dest pixels
        mov esi, src             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // ebx = remaining number of pixels
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw all 4-pixel runs (dwords)
        rep movsd
        mov ecx, ebx             // draw remaining pixels ( <= 3 bytes )
        rep movsb

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_r(int width,  
                   int lines,
                   byte *dest,
                   const byte *src,
                   int dest_y_inc,
                   int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of pixels to draw (bytes)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw pixels (bytes)
        rep movsb

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

Some compiler/toolchain-related things to note here first before going on:

  • I'm using Watcom C 11.0 for this. Thus, I can make use of the nice _asm block inline assembly support. However, as I demonstrated in my previous post, I had run into what looked like a compiler bug with Watcom's _asm support. After playing with it some more, I noticed I got good results by just moving all my _asm blocks to their own functions. This is kinda-sorta-maybe in some ways like using externally linked assembly functions via something like TASM or MASM. At least, as far as it just being a function call instead of slapped somewhere else inline in some block of C code. Kind of a nice separation, especially for these blit functions and it both fixes the problem I had run into and means I don't have to worry much about setting up a proper calling convention in whatever assembler I would otherwise be using (which would involve some icky #pragma usage). Which leads me into the next point ...
  • Watcom's default calling convention uses registers for the first 4 arguments (well, at least for 32-bit values). eax, edx, ebx, and ecx in that order. For any remaining arguments, the stack is used as per normal C calling convention. Because I'm using a pattern of putting my larger _asm blocks in non-static functions all by themselves, I can "hijack" this calling convention easily enough and skip some additional stack copying that Watcom would generate if I used the argument variable names for those first 4 arguments. This is probably kind of a dirty hack, but it seems to work well, and that's why you'll notice in these assembly functions that I don't seem to ever reference the first 4 arguments. I do, but they are already in registers.

I'm going to go ahead and claim that these are good enough. I mean, they've basically been reduced to a loop of rep movsd's in the best and by far most common case. I don't think it gets much better than that. I'm sure there's some optimization guru reading this that is face-palming after having read that statement and noticed some dumb thing I did in the code, but hey, I did say above that I'm no expert!

I did play with using ebp as well in direct_blit_4 since that function is called the most out of the three. Using ebp as a general purpose register in that function allows me to not use any values from the stack at all once inside the loop, so I figured it would be worthwhile. But it barely made any noticeable difference on my 486. It would probably be worth testing on a 386 though to see the difference, but I'll wait until I have an actual 386 to test with. For now, I decided to leave this optimization out. Using ebp in this way is tricky, as once you change it like this you cannot access anything using your variable names (the compiler, or assembler, replaces them with addresses relative to ebp). You could still access the stack using addresses relative to esp, but I decided not to go down this road at this time.

Alright, well, since the slowdowns really seem to be regarding the surface_blit_sprite calls in our draw_map() function, let's look at it. The core of it (as indicated by code shown previously) is implemented in surface_blit_sprite_region_f:

void surface_blit_sprite_region_f(const SURFACE *src,  
                                  SURFACE *dest,
                                  int src_x,
                                  int src_y,
                                  int src_width,
                                  int src_height,
                                  int dest_x,
                                  int dest_y) {
    const byte *psrc;
    byte *pdest;
    byte pixel;
    int src_y_inc, dest_y_inc;
    int width, width_4, width_8, width_remainder;
    int lines_left;
    int x;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    src_y_inc = src->width;
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    dest_y_inc = dest->width;
    width = src_width;
    lines_left = src_height;

    src_y_inc -= width;
    dest_y_inc -= width;

    width_4 = width / 4;
    width_remainder = width & 3;

    if (width_4 && !width_remainder) {
        if ((width_4 & 1) == 0) {
            // width is actually an even multiple of 8!
            direct_blit_sprite_8(width_4 / 2, lines_left, pdest, psrc, dest_y_inc, src_y_inc);
        } else {
            // width is a multiple of 4 (no remainder)
            direct_blit_sprite_4(width_4, lines_left, pdest, psrc, dest_y_inc, src_y_inc);
        }

    } else if (width_4 && width_remainder) {
        if ((width_4 & 1) == 0) {
            // width is _mostly_ made up of an even multiple of 8,
            // plus a small remainder
            direct_blit_sprite_8r(width_4 / 2, lines_left, pdest, psrc, width_remainder, dest_y_inc, src_y_inc);
        } else {
            // width is >= 4 and there is a remainder
            direct_blit_sprite_4r(width_4, lines_left, pdest, psrc, width_remainder, dest_y_inc, src_y_inc);
        }

    } else {
        // width is <= 3
        direct_blit_sprite_r(width_remainder, lines_left, pdest, psrc, dest_y_inc, src_y_inc);
    }
}

Immediately, we can see it's a bit different from surface_blit_region_f, but at its core it's the same basic three-scenario implementation. It's just been extended to also look for an even multiple of 8 and to call a slightly different assembly function for those cases.

Again, this was something that I initially didn't do for the first implementation of this function. I originally had a simple loop (written entirely in C code) that checked one pixel each iteration and if non-zero would draw it. Nice and simple. I then decided to try unrolling the loop and do 4 pixels per iteration, still only in C code. This gave a significant improvement, so I decided to try adding an extra 8-pixel-per-iteration version and saw an improvement again but not as significant this time. Still, it was enough that I thought it warranted keeping it.

Watcom's optimizer actually did a pretty damn good job with my C code version. I was able to tweak it slightly to help the optimizer out, but eventually decided that it was probably better to write it in assembly anyway. This is because it seemed rather easy to make what seemed like an extremely minor change to the code that would result in the optimizer generating some real inefficient block(s) of code.

Anyway, here are the direct_blit_sprite_xxxx functions:

void direct_blit_sprite_4(int width4,  
                          int lines,
                          byte *dest,
                          const byte *src,
                          int dest_y_inc,
                          int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:

    start_4_run:
        mov ecx, eax             // ecx = counter of 4-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz end_4_run
        mov [edi+3], bl
    end_4_run:
        add esi, 4               // move src and dest up 4 pixels
        add edi, 4
        dec ecx                  // decrease 4-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_4r(int width4,  
                           int lines,
                           byte *dest,
                           const byte *src,
                           int remainder,
                           int dest_y_inc,
                           int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:

    start_4_run:                 // draw 4-pixel runs first
        mov ecx, eax             // ecx = counter of 4-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz end_4_run
        mov [edi+3], bl
    end_4_run:
        add esi, 4               // move src and dest up 4 pixels
        add edi, 4
        dec ecx                  // decrease 4-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    start_remainder_run:         // now draw remaining pixels ( <= 3 pixels )
        mov ecx, remainder       // ecx = counter of remaining pixels

    draw_pixel:
        mov bl, [esi]            // load pixel
        inc esi
        test bl, bl              // if zero, skip to next pixel
        jz end_pixel
        mov [edi], bl            // else, draw pixel
    end_pixel:
        inc edi
        dec ecx
        jnz draw_pixel           // keep drawing pixels while there's still more

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_r(int width,  
                          int lines,
                          byte *dest,
                          const byte *src,
                          int dest_y_inc,
                          int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:
        mov ecx, eax             // ecx = counter of remaining pixels

    draw_pixel:
        mov bl, [esi]            // load pixel
        inc esi
        test bl, bl              // if zero, skip to next pixel
        jz end_pixel
        mov [edi], bl            // else, draw pixel
    end_pixel:
        inc edi
        dec ecx
        jnz draw_pixel           // loop while there's still pixels left

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_8(int width8,  
                          int lines,
                          byte *dest,
                          const byte *src,
                          int dest_y_inc,
                          int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 8-pixel runs
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:
        mov ecx, eax             // ecx = counter of 8-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz draw_px_4
        mov [edi+3], bl
    draw_px_4:
        mov bl, [esi+4]
        test bl, bl
        jz draw_px_5
        mov [edi+4], bl
    draw_px_5:
        mov bl, [esi+5]
        test bl, bl
        jz draw_px_6
        mov [edi+5], bl
    draw_px_6:
        mov bl, [esi+6]
        test bl, bl
        jz draw_px_7
        mov [edi+6], bl
    draw_px_7:
        mov bl, [esi+7]
        test bl, bl
        jz end_8_run
        mov [edi+7], bl
    end_8_run:
        add esi, 8               // move src and dest up 8 pixels
        add edi, 8
        dec ecx                  // decrease 8-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_8r(int width8,  
                           int lines,
                           byte *dest,
                           const byte *src,
                           int remainder,
                           int dest_y_inc,
                           int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 8-pixel runs
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:

    start_8_run:                 // draw 8-pixel runs first
        mov ecx, eax             // ecx = counter of 8-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz draw_px_4
        mov [edi+3], bl
    draw_px_4:
        mov bl, [esi+4]
        test bl, bl
        jz draw_px_5
        mov [edi+4], bl
    draw_px_5:
        mov bl, [esi+5]
        test bl, bl
        jz draw_px_6
        mov [edi+5], bl
    draw_px_6:
        mov bl, [esi+6]
        test bl, bl
        jz draw_px_7
        mov [edi+6], bl
    draw_px_7:
        mov bl, [esi+7]
        test bl, bl
        jz end_8_run
        mov [edi+7], bl
    end_8_run:
        add esi, 8               // move src and dest up 8 pixels
        add edi, 8
        dec ecx                  // decrease 8-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    start_remainder_run:         // now draw remaining pixels ( <= 7 pixels )
        mov ecx, remainder       // ecx = counter of remaining pixels

    draw_pixel:
        mov bl, [esi]            // load pixel
        inc esi
        test bl, bl              // if zero, skip to next pixel
        jz end_pixel
        mov [edi], bl            // else, draw pixel
    end_pixel:
        inc edi
        dec ecx
        jnz draw_pixel           // loop while there's still pixels left

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

As you may have been able to imagine before even seeing this code, these are very much implemented in the same general way as the non-transparent blits are. Instead of rep movsd and we have unrolled loops that must check each and every pixel for transparent pixels before drawing. Same for rep movsb, except the loops that replace these aren't unrolled.

After initially writing these assembly sprite blit routines, I started to look for ways that I might make the unrolled loop section faster. I began by trying to reduce the number of memory reads by reading 16-bits instead of 8-bits at a time. Then I would only need to read pixels every other time and could access two pixels by using bl and bh.

draw_px_0:  
    mov bx, [esi+0]          ; load two pixels at once
    test bl, bl
    jz draw_px_1             ; if this pixel is color 0, skip it
    mov [edi+0], bl          ; otherwise, draw it onto dest
draw_px_1:  
    test bh, bh              ; don't need to read, second pixel is already in bh
    jz draw_px_2
    mov [edi+1], bh

On my 486 this resulted in barely any noticeable difference. Maybe 1 FPS of an improvement for the slow scenario. I don't have a 386 to test with, but from looking at some instruction timing information, I'm guessing this should be an improvement on a 386 processor. mov reg, mem takes 4 clock cycles on a 386, versus 1 clock cycle on a 486, so by removing some of these operations we should save some time anyway.

The other thing I was curious about was using bswap. This instruction was added starting with 486 processors, so if I wanted to support 386's I couldn't use it anyway, but even still, I just wanted to try it. What bswap does is reverse the byte order of a 32-bit register. This can be used as a means to access the upper 16-bits of a 32-bit register, which you otherwise wouldn't be able to do independently since x86 architecture doesn't provide you with any 8/16 bit registers for the upper half. With this in mind I figured I would be able to do something like:

draw_px_0:  
    mov ebx, [esi+0]         ; load 4 src pixels
    test bl, bl
    jz draw_px_1             ; if it is color 0, skip it
    mov [edi+0], bl          ; otherwise, draw it onto dest
draw_px_1:  
    test bh, bh              ; don't need to read, second pixel is already in bh
    jz draw_px_2
    mov [edi+1], bh
draw_px_2:  
    bswap ebx                ; swap bytes. bh now has this pixel, bl is the next
    test bh, bh
    jz draw_px_3
    mov [edi+2], bh
draw_px_3:  
    test bl, bl
    jz draw_px_4
    mov [edi+3], bl

However to my surprise, this made no noticeable difference again. Anyway, doesn't really matter much since I could not use this on a 386. Using shr as an alternative method to access the upper 16 bits is no good. It's too expensive to use for something like this. shr reg, imm is 2 clock cycles on a 486 and 3 clock cycles on a 386, whereas bswap runs in only 1 cycle.

There might be some other improvements that can be made here, but nothing came to mind so I figured I'd move on.

Looking back at my draw_map() function, I figured why not call the direct_blit_xxxx and direct_blit_sprite_xxxx assembly functions directly? We can't do that for the tiles around the edges of the screen that need to be clipped, but we should absolutely be able to do this for the inner 90% of the tiles that are drawn. As an added benefit, since we're using 16x16 tiles, we know that we can always just call direct_blit_4 and direct_blit_sprite_8. All we need to do is manage all the source and destination memory parameters ourselves directly instead of x and y coordinates.

Probably won't be a large boost, but we'll see.

// now actually draw the map
index = (y_tile * map->width) + x_tile;  
tiledata = &map->tiledata[index * 2];  
tileset = map_tiles->tiles;

yp = -y_offs;  
xp = -x_offs;

pdest = surface_pointer(backbuffer, xp, yp);

for (y = 0; y < SCREEN_Y_TILES; ++y) {  
    for (x = 0; x < SCREEN_X_TILES; ++x) {
        tile1 = tiledata[0];
        tile2 = tiledata[1];

        if (x > 0 &&
            y > 0 &&
            x < (SCREEN_X_TILES - 1) &&
            y < (SCREEN_Y_TILES - 2)) {
            direct_blit_4(TILE_WIDTH / 4,
                          TILE_HEIGHT,
                          pdest,
                          tileset[tile1]->pixels,
                          320 - TILE_WIDTH,
                          0);
            if (tile2)
                direct_blit_sprite_8(TILE_WIDTH / 8,
                                     TILE_HEIGHT,
                                     pdest,
                                     tileset[tile2]->pixels,
                                     320 - TILE_WIDTH,
                                     0);
        } else {
            surface_blit(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
        }

        tiledata += 2;
        xp += TILE_WIDTH;
        pdest += TILE_WIDTH;
    }

    tiledata += (map->width - SCREEN_X_TILES) * 2;
    yp += TILE_HEIGHT;
    pdest += (TILE_HEIGHT * 320) - (SCREEN_X_TILES * TILE_WIDTH);
    xp = -x_offs;
}

Obviously no need to worry about the multiplications and divisions you now see in the above code. They're all based entirely on constants (except for one * 2 which will become a shl anyway by the optimizer).

At this point, the draw_map() function implementation is starting to look kind of messy to me. However, the improvement this brings is small but noticeable. Up to 159/160 FPS in the fast scenario and 105 FPS in the slow scenario.

The last thing I wanted to look at was if we could help mitigate the performance drop that occurs when the screen is scrolled to some position that results in us drawing all of the visible map tiles at unaligned memory addresses. In 32-bit protected mode (as I am using), we want to be aligned to a 4-byte boundary. So that means that, right now, beginning a blit at three out of every four x coordinate values across the entire width of the screen will make the blit unaligned.

How big a deal is this? Well, if I adjust the x coordinate of each of our two scenarios by one pixel (in either direction), we end up getting 135 FPS in the fast scenario and 94 FPS in the slow scenario. So, it's a big enough deal that we should at least see if we can do something to lessen the blow. This is a pixel-by-pixel scrolling engine after all, so these unaligned offsets will occur frequently.

I was not really too optimistic that I would be able to improve things with my current knowledge/experience. Admittedly, I've never really written code that tries to deal with memory alignment. As I understand it, the main slowdown is on the writes and, indeed, this is where the source of unaligned memory accesses will be for us.

We need not worry about our source tile graphics in this case, as the tiles in a typical tileset (including the one we're using here) will all be some even number like 16x16 or 32x32 and will either have each tile in their own allocated block of memory, or all tiles will be arranged in a grid on some larger allocated block of memory (but in this case, accessing an arbitrary tile in such a grid will result in some 4-byte aligned address anyway). For memory allocations, malloc() is likely ensuring that the allocation is aligned, so no worries there. It's important to note that libDGL's blit routines allow using an arbitrary region as the blit source, and this region could be located at an unaligned address. I suspect that this wouldn't be a common use case, so I chose to ignore it.

So after thinking about it a bit I figured that I would start with trying to optimize surface_blit_region_f first and ignore sprite/transparent blits for now (unsure what I can really do there to be honest).

I decided that I would try calculating the number of bytes at the start of each line in the blit that are before the next 4-byte boundary. Then I could do a rep movsb up to this next boundary. Following this, I could proceed with the remainder of the line like normal (that is, with a single rep movsd or a combo of rep movsd and rep movsb as appropriate based on our existing method).

However, I suspected this would be slower then just not dealing with memory alignment at all. rep movsb takes 12+3n clock cycles on a 486, and we're talking about adding an extra one in some cases. But anyway, I went ahead with it just to see. Here are the modifications I made to surface_blit_region_f:

// ...

psrc = (const byte*)surface_pointer(src, src_x, src_y);  
pdest = (byte*)surface_pointer(dest, dest_x, dest_y);  
lines = src_height;  
bytes_from_boundary = (4 - ((unsigned int)pdest & 3)) & 3;

if (bytes_from_boundary && src_width > 3) {  
    aligned_width = src_width - bytes_from_boundary;

    width_4 = aligned_width / 4;
    width_remainder = aligned_width & 3;

    if (width_4 && !width_remainder) {
        // aligned_width is a multiple of 4 (no remainder)
        direct_blit_u4(bytes_from_boundary, width_4, lines, pdest, psrc, dest_y_inc, src_y_inc);

    } else if (width_4 && width_remainder) {
        // aligned_width is >= 4 and there is a remainder ( <= 3 )
        direct_blit_u4r(bytes_from_boundary, width_4, lines, pdest, psrc, dest_y_inc, src_y_inc, width_remainder);

    } else {
        // aligned_width is <= 3, just take the lazy way out and ignore the fact that
        // this is unaligned
        direct_blit_r(bytes_from_boundary + width_remainder, lines, pdest, psrc, dest_y_inc, src_y_inc);
    }

} else {
    // ...
    // previous code to handle the 3 blit scenarios
    // ...
}

// ...

Two new functions were added, direct_blit_4u and direct_blit_u4r:

void direct_blit_u4(int unaligned_width,  
                    int width4,
                    int lines,
                    byte *dest,
                    const byte *src,
                    int dest_y_inc,
                    int src_y_inc) {
    _asm {
        mov edi, ecx             // dest pixels
        mov esi, src             // source pixels

        // eax = unaligned width
        // edx = number of 4-pixel runs (dwords)
        // ebx = line loop counter

        test ebx, ebx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw initial unaligned pixels ( <= 3 )
        rep movsb
        mov ecx, edx             // draw all 4-pixel runs (dwords)
        rep movsd

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec ebx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_u4r(int unaligned_width,  
                     int width4,
                     int lines,
                     byte *dest,
                     const byte *src,
                     int dest_y_inc,
                     int src_y_inc,
                     int remainder) {
    _asm {
        mov edi, ecx             // dest pixels
        mov esi, src             // source pixels

        // eax = unaligned width
        // edx = number of 4-pixel runs (dwords)
        // ebx = line loop counter

        test ebx, ebx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw initial unaligned pixels ( <= 3 )
        rep movsb
        mov ecx, edx             // draw all 4-pixel runs (dwords)
        rep movsd
        mov ecx, remainder       // draw remaining pixels ( <= 3 bytes )
        rep movsb

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec ebx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

In order to test this out in the draw_map() function, I decided that for now it would be simpler to revert back to just calling the normal surface_blit_xxxx functions. That way I don't need to clutter up that code even more then it already is with calculations to determine what direct_blit_xxxx function needs to be called based on the x coordinate. It'll run a little bit slower this way, but it doesn't matter, I just want to see if it's faster or not.

And as expected, it ended up being slower! By about 20 FPS in the fast scenario and 10/11 FPS in the slow scenario. I would guess that this would get better if I was doing larger blits, but probably with the tile size I am using the cost of an extra rep movsb is just not worth it.

Ultimately what I learned most from this experience is that I need to do more reading up on the subject, heh. I would not be surprised at all if I was missing something obvious here.

On the whole, this entire optimization exercise was useful and even though trying to address memory alignment didn't produce any results, I was able to improve overall performance. It would be interesting to compare results on a 386 at some point too, but that will have to wait.

Compiler Bugs?

I've been fairly lazy with working on personal coding projects over the past month, but I can say at least that some progress has been made on things. Small progress. Some bits of optimization work and bug fixing with various drawing functions such as bit-blits. Upon discovering some of the bugs I then had to spend time fixing, I realized that I really needed to re-prioritize making some kind of test suite for libDGL... which was something I kept putting off.

Anyway, first off, to follow up on some unanswered questions I had from my last post, I realized that the semi-lacking code inlining behaviour of Watcom C 10.0 was just how it worked. I suspect it's probably a bug. According to all the documentation I had read, the /oe compiler option should have been able to adjust the size of inline functions that the compiler would consider for inlining. The default setting is fairly small, and upon bumping it up I noticed absolutely no difference. Didn't matter what I set it to. Hrm. Spending a bunch of time tweaking my code to see if it was just a matter of helping the compiler out by giving it code it "likes" better proved equally fruitless. Just a limitation (or bug) of that particular version of the compiler.

Some time later I had the opportunity to pick up a brand-new-in-box copy of Watcom 11.0. I actually wasn't originally intending on getting this at all since I've read multiple comments that people seem to think 10.x is the "definitive" version in terms of features and stability. But since I happened across it for cheap, I figured "meh, why not." If nothing else, now I could make use of super easy inline assembly via _asm blocks. This is when I started rewriting a number of my drawing routines' inner loops and such into straight assembly. This wasn't even really required as I was actually fairly happy with the performance I'd been getting from the straight-C implementations, but I figured why not, it's now easy for me to do this.

One thing I noticed with Watcom 11.0 is that by default using just /oe that the inlining behaviour worked basically identically to what I saw with 10.0 However, increasing the size using say, /oe=40 (default is 20), actually made a difference. So definitely a bug in 10.0.

Well, just yesterday I was fixing up a bug in the way I calculated the width of bit-blits after clipping was taken into account and whether the blit can be done using rep movsds alone or using both rep movsds and a single small rep movsb (this particular bug was also what made me realize that I really needed some kind of test suite, like right now... I had made such a silly oversight in this code, heh). Upon some more thorough testing once I had finished, I realized I had run into everyone's favourite type of bug: my code worked wonderfully when compiling with debug settings, but not with release optimizations!

Anyway, it took me a little bit to figure out what was going on, but it appears to be a bug with how the compiler handles inline assembly that really has absolutely shattered my confidence in using this feature with Watcom going forward. It seems like this probably is pretty uncommon (I don't have this issue in any of my other routines), but even so... I don't want to have to second guess the compiler.

Anyway, so here's how I had written my surface_blit_region_f routine. This routine does no clipping itself (it assumes the source/destination regions are pre-clipped). As well, it's a solid bit-blit (no transparency handling). I realized there were basically 3 different scenarios where this would be called:

  • The source region has a width that is an even multiple of 4. Only rep movsds are needed. This is probably the most common scenario since most graphics in games have dimensions that are powers of two like 16x16, 32x32, 64x64, etc.
  • The source region has a width > 4, with a remaining number of pixels <= 3. rep movsds and a single remaining rep movsb can be used. Probably the second most common scenario, especially when you have a partially clipped image.
  • The source region has a width < 4. A single rep movsb can be used. Probably the least common scenario, would likely occur only when an image is almost completely clipped off the screen as I don't think many games used image sizes of 3x3, 2x2, etc, but I guess once in a while it happens.

I originally had this all handled as a single loop that would intelligently call as many rep movsds that were needed and then call rep movsb if needed. Performance was pretty good. Splitting the code up into 3 different loops matching the above scenarios didn't improve performance by much as expected, but I did get a little bit of a boost. Every bit is nice.

Anyway, here's the code I ended up with:

void surface_blit_region_f(const SURFACE *src,  
                           SURFACE *dest,
                           int src_x,
                           int src_y,
                           int src_width,
                           int src_height,
                           int dest_x,
                           int dest_y) {
    const byte *psrc;
    byte *pdest;
    int lines;
    int src_y_inc = src->width - src_width;
    int dest_y_inc = dest->width - src_width;
    int width_4, width_remainder;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    lines = src_height;

    width_4 = src_width / 4;
    width_remainder = src_width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_4     // eax = number of 4-pixel runs (dwords)

            mov edx, lines       // edx = line loop counter
            test edx, edx        // make sure there is >0 lines to draw
        draw_line:
            jz done              // if no more lines to draw, then we're done

            mov ecx, ebx         // draw all 4-pixel runs (dwords)
            rep movsd

            add esi, src_y_inc   // move to next line
            add edi, dest_y_inc
            dec edx              // decrease line loop counter
            jmp draw_line
        done:
        }

    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov eax, width_4         // eax = number of 4-pixel runs (dwords)
            mov ebx, width_remainder // ebx = remaining number of pixels

            mov edx, lines       // edx = line loop counter
            test edx, edx        // make sure there is >0 lines to draw
        draw_line:
            jz done              // if no more lines to draw, then we're done

            mov ecx, eax         // draw all 4-pixel runs (dwords)
            rep movsd
            mov ecx, ebx         // draw remaining pixels ( <= 3 bytes )
            rep movsb

            add esi, src_y_inc   // move to next line
            add edi, dest_y_inc
            dec edx              // decrease line loop counter
            jmp draw_line
        done:
        }

    } else {
        // width is <= 3
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_remainder // ebx = number of pixels to draw (bytes)

            mov edx, lines       // edx = line loop counter
            test edx, edx        // make sure there is >0 lines to draw
        draw_line:
            jz done              // if no more lines to draw, then we're done

            mov ecx, ebx         // draw pixels (bytes)
            rep movsb

            add esi, src_y_inc   // move to next line
            add edi, dest_y_inc
            dec edx              // decrease line loop counter
            jmp draw_line
        done:
        }
    }
}

Initial testing was good (when using debugging compiler options)! Then I switched to release optimizations and ran through more scenarios and noticed problems... eventually when I thought to look at the assembly output, I noticed this:

; void surface_blit_region_f(const SURFACE *src,
;                            SURFACE *dest,
;                            int src_x,
;                            int src_y,
;                            int src_width,
;                            int src_height,
;                            int dest_x,
;                            int dest_y) {
;     const byte *psrc;
;     byte *pdest;
;     int lines;
surface_blit_region_f_:  
                push    esi
                push    edi
                push    ebp
                mov     ebp,esp
                sub     esp,0000001cH
L53:            mov     esi,ebx  
                mov     ebx,ecx
                mov     ecx,dword ptr +10H[ebp]

;     int src_y_inc = src->width - src_width;
                mov     edi,dword ptr [eax]
                sub     edi,ecx
                mov     dword ptr -10H[ebp],edi

;     int dest_y_inc = dest->width - src_width;
;     int width_4, width_remainder;
; 
;     psrc = (const byte*)surface_pointer(src, src_x, src_y);
;     pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
                mov     edi,dword ptr [edx]
                sub     edi,ecx
                mov     dword ptr -0cH[ebp],edi
L54:            imul    ebx,dword ptr [eax]  
                mov     eax,dword ptr +8H[eax]
                add     esi,ebx
                add     eax,esi
                mov     dword ptr -1cH[ebp],eax
L55:            mov     ebx,dword ptr +1cH[ebp]  
                mov     eax,dword ptr [edx]
                imul    eax,ebx
                mov     esi,dword ptr +18H[ebp]
                mov     edx,dword ptr +8H[edx]
                add     eax,esi
                add     eax,edx
                mov     dword ptr -18H[ebp],eax

;     lines = src_height;
; 
                mov     eax,dword ptr +14H[ebp]
                mov     dword ptr -14H[ebp],eax
                mov     edx,ecx

;     width_4 = src_width / 4;
                mov     eax,ecx
                sar     edx,1fH
                shl     edx,02H
                sbb     eax,edx
                sar     eax,02H
                mov     dword ptr -8H[ebp],eax

;     width_remainder = src_width & 3;
; 
                and     ecx,00000003H
                mov     dword ptr -4H[ebp],ecx

;     if (width_4 && !width_remainder) {
;         // width is a multiple of 4 (no remainder)
                mov     edi,dword ptr -8H[ebp]
L56:            test    edi,edi  
                je      short L57
                cmp     dword ptr -4H[ebp],00000000H
                jne     short L57

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_4     // eax = number of 4-pixel runs (dwords)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, eax         // draw all 4-pixel runs (dwords)
;             rep movsd
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
                je      short L58
                mov     ecx,eax
                repe    movsd    
                add     esi,dword ptr -10H[ebp]

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
                jmp     short L63
L57:            cmp     dword ptr -8H[ebp],00000000H  
                je      short L61
                DB      83H,7dH,0fcH,00H
                je      short L61

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_4         // eax = number of 4-pixel runs (dwords)
;             mov ebx, width_remainder // ebx = remaining number of pixels
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, eax         // draw all 4-pixel runs (dwords)
;             rep movsd
;             mov ecx, ebx         // draw remaining pixels ( <= 3 bytes )
;             rep movsb
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     ebx,dword ptr -4H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L59:            je      short L60  
                mov     ecx,eax
                repe    movsd    
                mov     ecx,ebx
                repe    movsb    
                add     esi,dword ptr -10H[ebp]
                add     edi,dword ptr -0cH[ebp]
                dec     edx
                jmp     short L59

;     } else {
;         // width is <= 3
L60:            jmp     short L64

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_remainder // ebx = number of pixels to draw (bytes)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, ebx         // draw pixels (bytes)
;             rep movsb
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
;         }
L61:            mov     esi,dword ptr -1cH[ebp]  
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -4H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L62:            je      short L64  
                mov     ecx,ebx
                repe    movsb    
                add     esi,dword ptr -10H[ebp]
L63:            add     edi,dword ptr -0cH[ebp]  
                dec     edx
                jmp     short L62

; }
; 
L64:            mov     esp,ebp  
                pop     ebp
                pop     edi
                pop     esi
                ret     0010H

At first glance, this may look fine. But look at the code for the first scenario blit within the width_4 && !width_remainder condition:

                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
                je      short L58
                mov     ecx,eax
                repe    movsd    
                add     esi,dword ptr -10H[ebp]

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
                jmp     short L63
L57:            cmp     dword ptr -8H[ebp],00000000H  
                je      short L61
                DB      83H,7dH,0fcH,00H
                je      short L61

Uhh, what? The compiler just appeared to have chopped off the last bit of the blit loop and then headed on to the following else if. Well, what's at the L63 label that it jumps to...

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_remainder // ebx = number of pixels to draw (bytes)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, ebx         // draw pixels (bytes)
;             rep movsb
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
;         }
L61:            mov     esi,dword ptr -1cH[ebp]  
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -4H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L62:            je      short L64  
                mov     ecx,ebx
                repe    movsb    
                add     esi,dword ptr -10H[ebp]
L63:            add     edi,dword ptr -0cH[ebp]  
                dec     edx
                jmp     short L62

Huh. It jumps into the end of the third blit scenario's loop. Of course, then it jumps back to L62 and continues running the wrong blit from that point on.

So obviously at this point the logical conclusion is that the compiler was mixed up because I had three _asm blocks with identical labels. This was actually something I had used elsewhere with no problems, as Watcom seems able to make labels within any _asm block unique to that block only. But what I was seeing here seemed to indicate that this was maybe not a bullet proof feature. So I changed all the labels to be uniquely named and noticed no change whatsoever! Huh?

A short while later I just decided to try adding nops at random places. Confusingly enough, that seemed to do the trick:

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov eax, width_4     // eax = number of 4-pixel runs (dwords)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;         draw_line:
;             jz done              // if no more lines to draw, then we're done
; 
;             mov ecx, eax         // draw all 4-pixel runs (dwords)
;             rep movsd
; 
;             nop
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     eax,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
L57:            je      short L58  
                mov     ecx,eax
                repe    movsd    
                nop     
                add     esi,dword ptr -10H[ebp]
                add     edi,dword ptr -0cH[ebp]
                dec     edx
                jmp     short L57

I experimented with the placement of the nop a little more and it can be pretty much anywhere after L57 as shown above (and before the final jmp of course) and it "fixes" the problem. What the heck?

Another thing I tried was rearranging where I do the jz (or je):

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width_4     // eax = number of 4-pixel runs (dwords)
; 
;             mov edx, lines       // edx = line loop counter
;             test edx, edx        // make sure there is >0 lines to draw
;             jz done              // if no more lines to draw, then we're done
;         draw_line:
;             mov ecx, ebx         // draw all 4-pixel runs (dwords)
;             rep movsd
; 
;             add esi, src_y_inc   // move to next line
;             add edi, dest_y_inc
;             dec edx              // decrease line loop counter
;             jz done              // if no more lines to draw, then we're done
;             jmp draw_line
;         done:
;         }
; 
                mov     esi,dword ptr -1cH[ebp]
                mov     edi,dword ptr -18H[ebp]
                mov     ebx,dword ptr -8H[ebp]
                mov     edx,dword ptr -14H[ebp]
                test    edx,edx
                je      short L58
L57:            mov     ecx,ebx  
                repe    movsd    
                add     esi,dword ptr -10H[ebp]
                add     edi,dword ptr -0cH[ebp]
                dec     edx
                je      short L58
                jmp     short L57

So that solves the problem too.

Of course, I don't like this at all. Why does this particular piece of code cause the compiler to mess up like this? I fear I may never know! One last thing I wanted to try was using the Watcom-recommended approach to inline assembly, and to use #pragma aux instead of _asm. This was also improved in 11.0 allowing you to also refer to your C variables just as I was doing here with _asm. Of course, the syntax is far uglier, but it does have the added benefit of allowing the compiler to stitch together your assembly with the surrounding code a bit better:

void surface_blit_region_f(const SURFACE *src,  
                           SURFACE *dest,
                           int src_x,
                           int src_y,
                           int src_width,
                           int src_height,
                           int dest_x,
                           int dest_y) {
    const byte *psrc;
    byte *pdest;
    int lines;
    int src_y_inc = src->width - src_width;
    int dest_y_inc = dest->width - src_width;
    int width_4, width_remainder;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    lines = src_height;

    width_4 = src_width / 4;
    width_remainder = src_width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        extern void _inner_blit4(byte *dest, const byte *src, int width4, int lines);
        #pragma aux _inner_blit4 =       \
            "    test edx, edx"          \
            "draw_line:"                 \
            "    jz done"                \
            ""                           \
            "    mov ecx, eax"           \
            "    rep movsd"              \
            ""                           \
            "    add esi, src_y_inc"     \
            "    add edi, dest_y_inc"    \
            "    dec edx"                \
            "    jmp draw_line"          \
            "done:"                      \
            parm [edi] [esi] [eax] [edx] \
            modify [ecx];

        _inner_blit4(pdest, psrc, width_4, lines);

    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        extern void _inner_blit4r(byte *dest, const byte *src, int width4, int remainder, int lines);
        #pragma aux _inner_blit4r =            \
            "    test edx, edx"                \
            "draw_line:"                       \
            "    jz done"                      \
            ""                                 \
            "    mov ecx, eax"                 \
            "    rep movsd"                    \
            "    mov ecx, ebx"                 \
            "    rep movsb"                    \
            ""                                 \
            "    add esi, src_y_inc"           \
            "    add edi, dest_y_inc"          \
            "    dec edx"                      \
            "    jmp draw_line"                \
            "done:"                            \
            parm [edi] [esi] [eax] [ebx] [edx] \
            modify [ecx];

        _inner_blit4r(pdest, psrc, width_4, width_remainder, lines);

    } else {
        // width is <= 3
        extern void _inner_blitb(byte *dest, const byte *src, int width, int lines);
        #pragma aux _inner_blitb =       \
            "    test edx, edx"          \
            "draw_line:"                 \
            "    jz done"                \
            ""                           \
            "    mov ecx, ebx"           \
            "    rep movsb"              \
            ""                           \
            "    add esi, src_y_inc"     \
            "    add edi, dest_y_inc"    \
            "    dec edx"                \
            "    jmp draw_line"          \
            "done:"                      \
            parm [edi] [esi] [ebx] [edx] \
            modify [ecx];

        _inner_blitb(pdest, psrc, width_remainder, lines);
    }
}

And the relevant compiler generated output:

;     if (width_4 && !width_remainder) {
;         // width is a multiple of 4 (no remainder)
;         extern void _inner_blit4(byte *dest, const byte *src, int width4, int lines);
;         #pragma aux _inner_blit4 =       \
;             "    test edx, edx"          \
;             "draw_line:"                 \
;             "    jz done"                \
;             ""                           \
;             "    mov ecx, eax"           \
;             "    rep movsd"              \
;             ""                           \
;             "    add esi, src_y_inc"     \
;             "    add edi, dest_y_inc"    \
;             "    dec edx"                \
;             "    jmp draw_line"          \
;             "done:"                      \
;             parm [edi] [esi] [eax] [edx] \
;             modify [ecx];
; 
L56:            test    eax,eax  
                je      short L57
                test    ebx,ebx
                jne     short L57

;         _inner_blit4(pdest, psrc, width_4, lines);
; 
                mov     edx,edi
                mov     edi,ecx
                test    edx,edx
                je      short L58
                mov     ecx,eax
                repe    movsd    
                add     esi,dword ptr -8H[ebp]

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
;         extern void _inner_blit4r(byte *dest, const byte *src, int width4, int remainder, int lines);
;         #pragma aux _inner_blit4r =            \
;             "    test edx, edx"                \
;             "draw_line:"                       \
;             "    jz done"                      \
;             ""                                 \
;             "    mov ecx, eax"                 \
;             "    rep movsd"                    \
;             "    mov ecx, ebx"                 \
;             "    rep movsb"                    \
;             ""                                 \
;             "    add esi, src_y_inc"           \
;             "    add edi, dest_y_inc"          \
;             "    dec edx"                      \
;             "    jmp draw_line"                \
;             "done:"                            \
;             parm [edi] [esi] [eax] [ebx] [edx] \
;             modify [ecx];
; 
                jmp     short L63
L57:            test    eax,eax  
                je      short L61
                test    ebx,ebx
                DB      74H,21H

So, still results in the same bug. Bleh.

Just to take another opportunity to dump a bunch more code in this post, here's my surface_blit_sprite_region_f function which is basically the same idea as surface_blit_region_f, except that as it's name suggests, it deals with transparency and skips over source pixels that are colour zero. But the same idea of splitting it up into three separate blit loops for the different scenarios outlined above is still there, complete with three separate _asm blocks:

void surface_blit_sprite_region_f(const SURFACE *src,  
                                  SURFACE *dest,
                                  int src_x,
                                  int src_y,
                                  int src_width,
                                  int src_height,
                                  int dest_x,
                                  int dest_y) {
    const byte *psrc;
    byte *pdest;
    byte pixel;
    int src_y_inc, dest_y_inc;
    int width, width_4, width_remainder;
    int lines_left;
    int x;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    src_y_inc = src->width;
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    dest_y_inc = dest->width;
    width = src_width;
    lines_left = src_height;
    src_y_inc -= width;
    dest_y_inc -= width;

    width_4 = width / 4;
    width_remainder = width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_4      // get number of 4-pixel runs to be drawn
            mov ecx, lines_left
            test ecx, ecx         // make sure there is >0 lines to be drawn
draw_line:  
            jz done

start_4_run:  
            mov edx, ebx          // dx = counter of 4-pixel runs left to draw
draw_px_0:  
            mov al, [esi]+0       // load src pixel
            test al, al
            jz draw_px_1          // if it is color 0, skip it
            mov [edi]+0, al       // otherwise, draw it onto dest
draw_px_1:  
            mov al, [esi]+1
            test al, al
            jz draw_px_2
            mov [edi]+1, al
draw_px_2:  
            mov al, [esi]+2
            test al, al
            jz draw_px_3
            mov [edi]+2, al
draw_px_3:  
            mov al, [esi]+3
            test al, al
            jz end_4_run
            mov [edi]+3, al
end_4_run:  
            add esi, 4            // move src and dest up 4 pixels
            add edi, 4
            dec edx               // decrease 4-pixel run loop counter
            jnz draw_px_0         // if there are still more runs, draw them

end_line:  
            add esi, src_y_inc    // move src and dest to start of next line
            add edi, dest_y_inc
            dec ecx               // decrease line loop counter
            jmp draw_line
done:  
        }


    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width_4      // get number of 4-pixel runs to be drawn
            mov ecx, lines_left
            test ecx, ecx         // make sure there is >0 lines to be drawn
draw_line:  
            jz done

            test ebx, ebx
            jz start_remainder_run // if no 4-pixel runs, just draw remainder

start_4_run:                      // draw 4-pixel runs first  
            mov edx, ebx          // dx = counter of 4-pixel runs left to draw
draw_px_0:  
            mov al, [esi]+0       // load src pixel
            test al, al
            jz draw_px_1          // if it is color 0, skip it
            mov [edi]+0, al       // otherwise, draw it onto dest
draw_px_1:  
            mov al, [esi]+1
            test al, al
            jz draw_px_2
            mov [edi]+1, al
draw_px_2:  
            mov al, [esi]+2
            test al, al
            jz draw_px_3
            mov [edi]+2, al
draw_px_3:  
            mov al, [esi]+3
            test al, al
            jz end_4_run
            mov [edi]+3, al
end_4_run:  
            add esi, 4            // move src and dest up 4 pixels
            add edi, 4
            dec edx               // decrease 4-pixel run loop counter
            jnz draw_px_0         // if there are still more runs, draw them

start_remainder_run:              // now draw remaining pixels ( <= 3 pixels )  
            mov edx, width_remainder // dx = counter of remaining pixels
            test edx, edx
            jz end_line           // if no remaining pixels, goto line end

draw_pixel:  
            mov al, [esi]         // load pixel
            inc esi
            test al, al           // if zero, skip to next pixel
            jz end_pixel
            mov [edi], al         // else, draw pixel
end_pixel:  
            inc edi
            dec edx
            jz end_line           // loop while (x)
            jmp draw_pixel

end_line:  
            add esi, src_y_inc    // move src and dest to start of next line
            add edi, dest_y_inc
            dec ecx               // decrease line loop counter
            jmp draw_line
done:  
        }

    } else {
        // width is <= 3
        _asm {
            mov esi, psrc
            mov edi, pdest

            mov ebx, width        // get number of pixels to be drawn
            mov ecx, lines_left
            test ecx, ecx         // make sure there is >0 lines to be drawn
draw_line:  
            jz done

            mov edx, ebx          // dx = counter of remaining pixels
draw_pixel:  
            mov al, [esi]         // load pixel
            inc esi
            test al, al           // if zero, skip to next pixel
            jz end_pixel
            mov [edi], al         // else, draw pixel
end_pixel:  
            inc edi
            dec edx
            jz end_line           // loop while (x)
            jmp draw_pixel

end_line:  
            add esi, src_y_inc    // move src and dest to start of next line
            add edi, dest_y_inc
            dec ecx               // decrease line loop counter
            jmp draw_line
done:  
        }
    }
}

And the code that the compiler generated:

; void surface_blit_sprite_region_f(const SURFACE *src,
;                                   SURFACE *dest,
;                                   int src_x,
;                                   int src_y,
;                                   int src_width,
;                                   int src_height,
;                                   int dest_x,
;                                   int dest_y) {
;     const byte *psrc;
;     byte *pdest;
;     byte pixel;
;     int src_y_inc, dest_y_inc;
;     int width, width_4, width_remainder;
;     int lines_left;
;     int x;
; 
;     psrc = (const byte*)surface_pointer(src, src_x, src_y);
surface_blit_sprite_region_f_:  
                push    esi
                push    edi
                push    ebp
                mov     ebp,esp
                sub     esp,00000020H
L69:            mov     esi,dword ptr [eax]  
                imul    ecx,esi
                add     ebx,ecx
                mov     ecx,dword ptr +8H[eax]
                add     ecx,ebx
                mov     dword ptr -20H[ebp],ecx

;     src_y_inc = src->width;
;     pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
                mov     dword ptr -18H[ebp],esi
L70:            mov     edi,dword ptr +1cH[ebp]  
                mov     eax,dword ptr [edx]
                imul    eax,edi
                add     eax,dword ptr +18H[ebp]
                mov     ecx,dword ptr +8H[edx]
                add     eax,ecx
                mov     dword ptr -1cH[ebp],eax

;     dest_y_inc = dest->width;
                mov     eax,dword ptr [edx]
                mov     dword ptr -14H[ebp],eax

;     width = src_width;
                mov     eax,dword ptr +10H[ebp]
                mov     dword ptr -10H[ebp],eax

;     lines_left = src_height;
                mov     eax,dword ptr +14H[ebp]
                mov     dword ptr -4H[ebp],eax

;     src_y_inc -= width;
                mov     eax,dword ptr -10H[ebp]
                sub     dword ptr -18H[ebp],eax

;     dest_y_inc -= width;
; 
                mov     eax,dword ptr -10H[ebp]
                sub     dword ptr -14H[ebp],eax

;     width_4 = width / 4;
                mov     eax,dword ptr -10H[ebp]
                mov     edx,dword ptr -10H[ebp]
                sar     edx,1fH
                shl     edx,02H
                sbb     eax,edx
                sar     eax,02H
                mov     dword ptr -0cH[ebp],eax

;     width_remainder = width & 3;
; 
                mov     eax,dword ptr -10H[ebp]
                and     eax,00000003H
                mov     dword ptr -8H[ebp],eax

;     if (width_4 && !width_remainder) {
;         // width is a multiple of 4 (no remainder)
                mov     edi,dword ptr -0cH[ebp]
L71:            test    edi,edi  
                je      short L79
                cmp     dword ptr -8H[ebp],00000000H
                jne     short L79

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width_4      // get number of 4-pixel runs to be drawn
;             mov ecx, lines_left
;             test ecx, ecx         // make sure there is >0 lines to be drawn
; draw_line:
;             jz done
; 
; start_4_run:
;             mov edx, ebx          // dx = counter of 4-pixel runs left to draw
; draw_px_0:
;             mov al, [esi]+0       // load src pixel
;             test al, al
;             jz draw_px_1          // if it is color 0, skip it
;             mov [edi]+0, al       // otherwise, draw it onto dest
; draw_px_1:
;             mov al, [esi]+1
;             test al, al
;             jz draw_px_2
;             mov [edi]+1, al
; draw_px_2:
;             mov al, [esi]+2
;             test al, al
;             jz draw_px_3
;             mov [edi]+2, al
; draw_px_3:
;             mov al, [esi]+3
;             test al, al
;             jz end_4_run
;             mov [edi]+3, al
; end_4_run:
;             add esi, 4            // move src and dest up 4 pixels
;             add edi, 4
;             dec edx               // decrease 4-pixel run loop counter
;             jnz draw_px_0         // if there are still more runs, draw them
; 
; end_line:
;             add esi, src_y_inc    // move src and dest to start of next line
;             add edi, dest_y_inc
;             dec ecx               // decrease line loop counter
;             jmp draw_line
; done:
;         }
; 
; 
                mov     esi,dword ptr -20H[ebp]
                mov     edi,dword ptr -1cH[ebp]
                mov     ebx,dword ptr -0cH[ebp]
                mov     ecx,dword ptr -4H[ebp]
                test    ecx,ecx
L72:            je      short L78  
                mov     edx,ebx
L73:            mov     al,byte ptr [esi]  
                test    al,al
                je      short L74
                mov     byte ptr [edi],al
L74:            mov     al,byte ptr +1H[esi]  
                test    al,al
                je      short L75
                mov     byte ptr +1H[edi],al
L75:            mov     al,byte ptr +2H[esi]  
                test    al,al
                je      short L76
                mov     byte ptr +2H[edi],al
L76:            mov     al,byte ptr +3H[esi]  
                test    al,al
                je      short L77
                mov     byte ptr +3H[edi],al
L77:            add     esi,00000004H  
                add     edi,00000004H
                dec     edx
                jne     short L73
                add     esi,dword ptr -18H[ebp]
                add     edi,dword ptr -14H[ebp]
                dec     ecx
                jmp     short L72

;     } else if (width_4 && width_remainder) {
;         // width is >= 4 and there is a remainder ( <= 3 )
L78:            jmp     near ptr L96  
L79:            cmp     dword ptr -0cH[ebp],00000000H  
                je      near ptr L91
                cmp     dword ptr -8H[ebp],00000000H
                je      near ptr L91

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width_4      // get number of 4-pixel runs to be drawn
;             mov ecx, lines_left
;             test ecx, ecx         // make sure there is >0 lines to be drawn
; draw_line:
;             jz done
; 
;             test ebx, ebx
;             jz start_remainder_run // if no 4-pixel runs, just draw remainder
; 
; start_4_run:                      // draw 4-pixel runs first
;             mov edx, ebx          // dx = counter of 4-pixel runs left to draw
; draw_px_0:
;             mov al, [esi]+0       // load src pixel
;             test al, al
;             jz draw_px_1          // if it is color 0, skip it
;             mov [edi]+0, al       // otherwise, draw it onto dest
; draw_px_1:
;             mov al, [esi]+1
;             test al, al
;             jz draw_px_2
;             mov [edi]+1, al
; draw_px_2:
;             mov al, [esi]+2
;             test al, al
;             jz draw_px_3
;             mov [edi]+2, al
; draw_px_3:
;             mov al, [esi]+3
;             test al, al
;             jz end_4_run
;             mov [edi]+3, al
; end_4_run:
;             add esi, 4            // move src and dest up 4 pixels
;             add edi, 4
;             dec edx               // decrease 4-pixel run loop counter
;             jnz draw_px_0         // if there are still more runs, draw them
; 
; start_remainder_run:              // now draw remaining pixels ( <= 3 pixels )
;             mov edx, width_remainder // dx = counter of remaining pixels
;             test edx, edx
;             jz end_line           // if no remaining pixels, goto line end
; 
; draw_pixel:
;             mov al, [esi]         // load pixel
;             inc esi
;             test al, al           // if zero, skip to next pixel
;             jz end_pixel
;             mov [edi], al         // else, draw pixel
; end_pixel:
;             inc edi
;             dec edx
;             jz end_line           // loop while (x)
;             jmp draw_pixel
; 
; end_line:
;             add esi, src_y_inc    // move src and dest to start of next line
;             add edi, dest_y_inc
;             dec ecx               // decrease line loop counter
;             jmp draw_line
; done:
;         }
; 
;     } else {
;         // width is <= 3
                mov     esi,dword ptr -20H[ebp]
                mov     edi,dword ptr -1cH[ebp]
                mov     ebx,dword ptr -0cH[ebp]
                mov     ecx,dword ptr -4H[ebp]
                test    ecx,ecx
L80:            je      short L90  
                test    ebx,ebx
                je      short L86
                mov     edx,ebx
L81:            mov     al,byte ptr [esi]  
                test    al,al
                je      short L82
                mov     byte ptr [edi],al
L82:            mov     al,byte ptr +1H[esi]  
                test    al,al
                je      short L83
                mov     byte ptr +1H[edi],al
L83:            mov     al,byte ptr +2H[esi]  
                test    al,al
                je      short L84
                mov     byte ptr +2H[edi],al
L84:            mov     al,byte ptr +3H[esi]  
                test    al,al
                je      short L85
                mov     byte ptr +3H[edi],al
L85:            add     esi,00000004H  
                add     edi,00000004H
                dec     edx
                jne     short L81
L86:            mov     edx,dword ptr -8H[ebp]  
                test    edx,edx
                je      short L89
L87:            mov     al,byte ptr [esi]  
                inc     esi
                test    al,al
                je      short L88
                mov     byte ptr [edi],al
L88:            inc     edi  
                dec     edx
                je      short L89
                jmp     short L87
L89:            add     esi,dword ptr -18H[ebp]  
                add     edi,dword ptr -14H[ebp]
                dec     ecx
                jmp     short L80
L90:            mov     esp,ebp  
                pop     ebp
                pop     edi
                pop     esi
                ret     0010H

;         _asm {
;             mov esi, psrc
;             mov edi, pdest
; 
;             mov ebx, width        // get number of pixels to be drawn
;             mov ecx, lines_left
;             test ecx, ecx         // make sure there is >0 lines to be drawn
; draw_line:
;             jz done
; 
;             mov edx, ebx          // dx = counter of remaining pixels
; draw_pixel:
;             mov al, [esi]         // load pixel
;             inc esi
;             test al, al           // if zero, skip to next pixel
;             jz end_pixel
;             mov [edi], al         // else, draw pixel
; end_pixel:
;             inc edi
;             dec edx
;             jz end_line           // loop while (x)
;             jmp draw_pixel
; 
; end_line:
;             add esi, src_y_inc    // move src and dest to start of next line
;             add edi, dest_y_inc
;             dec ecx               // decrease line loop counter
;             jmp draw_line
; done:
;         }
;     }
L91:            mov     esi,dword ptr -20H[ebp]  
                mov     edi,dword ptr -1cH[ebp]
                mov     ebx,dword ptr -10H[ebp]
                mov     ecx,dword ptr -4H[ebp]
                test    ecx,ecx
L92:            je      short L96  
                mov     edx,ebx
L93:            mov     al,byte ptr [esi]  
                inc     esi
                test    al,al
                je      short L94
                mov     byte ptr [edi],al
L94:            inc     edi  
                dec     edx
                je      short L95
                jmp     short L93
L95:            add     esi,dword ptr -18H[ebp]  
                add     edi,dword ptr -14H[ebp]
                dec     ecx
                jmp     short L92

; }
; 
L96:            mov     esp,ebp  
                pop     ebp
                pop     edi
                pop     esi
                ret     0010H

surface_blit_sprite_region_f passes the exact same set of tests (under both, debug and release optimizations) that I ran surface_blit_region_f through with no problems at all. Aarrghh!

So what is the takeaway? Well, I won't be relying on inline assembly anymore. Realistically, I wasn't planning on having gobs of assembly in libDGL anyway. Just for some inner loops and such that I wanted to make sure were running as lean as possible. Instead of writing it all inline, I'll just be moving it out to separate assembly source files and using either WASM or TASM.

Exploring Watcom C/C++

A couple weeks ago I picked up a complete boxed copy of Watcom 10.0a. I was also able to find an almost complete set of the printed documentation/manuals for Watcom 9.5 from a somewhat-local seller.

I do love having physical copies of books, especially reference material like this. Most people will probably think I'm weird, but I find this more convenient to refer to in many (but not all) cases. Being able to physically flip through pages at will and back and forth between several different pages as needed just feels a ton more convenient to me then doing the same with a PDF. And I've always preferred reading text like this on real paper. The obvious downsides versus an electronic copy is the lack of ability to do a Ctrl-F to find something specific and also not being able to copy+paste something from it.

So what's so special about Watcom anyway? Well, for me personally, it was the first C compiler I ever used. Not this version though. In 1996 I remember my dad bought me a book, "C DiskTutor" by L. John Ribar (published in 1992) that came with a disk containing sample code and a stripped down copy of the Watcom C 8.5 compiler. The disk actually had a few bad spots on it and I remember he called the publisher and they sent out another disk. I didn't realize it at the time, but the version of the compiler on the disk lacked the ability to compile code with any memory model other then the small model, so it wasn't suitable for really any "real world" use. Certainly good enough for introductory exercises though, which was obviously the intended purpose.

I didn't spend a whole lot of time using it though. I remember getting frustrated with fixing my own bugs as there was no included debugger with this stripped down version. At least I don't remember there being one. But also I remember trying to make sense of the poor compiler error messages was tough for me being completely new to C at the time. Finally, I also remember really hating how I had to quit out of EDIT.COM to compile and run my code. The disk included some simple text editor which seemed to be written specifically for this book. It did have the ability to run the compiler from within the editor, but I vaguely remember arriving at the conclusion that this editor was not very good and a bit buggy. EDIT.COM ended up being a less-frustrating option for writing code. Keep in mind my only other programming experience at this time was a little over a year with QBasic which had a much faster and easier edit/compile/run cycle. I didn't understand why anyone would want to use something else when it was so much easier! Of course years later I learnt that many people at that time used better editors then EDIT.COM and actually really liked writing code in their editor of choice and using external build scripts / Makefiles. Heh.

Beyond my nostalgia though, Watcom was used by all sorts of game developers who cared about code efficiency in the early-to-mid 90's, such as id Software with Doom and 3D Realms with Duke Nukem 3D amongst many others. Plus it shipped with a royalty-free version of the well known DOS extender DOS/4GW. However, while it was better at code generation then the competition (Borland, Microsoft, etc), it lacked in the end-user experience. Borland and Microsoft had better tooling in the form of IDE's (Watcom didn't ship an IDE with their compiler until 10.0) and easier to use command line tools.

Eventually other compilers caught up and Watcom converted to an open-source project when Sybase (who owned Watcom) decided to end-of-life the compiler.

So far I've been using DJGPP (which is just a DOS port of GCC) for all my DOS coding projects. I'm using an old version that corresponds roughly with what was current when I first discovered it for no other reason then that it feels "true" to the whole spirit of these DOS coding projects.

DJGPP is pretty awesome honestly. Since it's just an older version of GCC for DOS, it feels very familiar. Lots of the same tools are present, including GDB and MAKE. And the code generation is really quite excellent. I think my only gripes with DJGPP are that it uses the AT&T syntax for inline assembly and that I kind of have a tiny bit of a love/hate relationship with the CWSDPMI DOS extender and DPMI server it uses. It's nothing major really, just minor gripes relating around how it seems more "complex" to use on the surface as compared to something like DOS/4G. But I realize that there are good reasons for it. And personally I believe that CWSDPMI is probably a bit more robust then DOS/4GW, but I could be wrong. Feels that way sometimes though.

RHIDE is an IDE for DJGPP that seems to be largely inspired by Borland's DOS IDEs. It's also pretty great and is very quick to get up and running. You don't even need to do any fiddling about with GCC or MAKE directly when using RHIDE as it will handle that all for you. Awesome. Though it's not without its quirks and limitations. The editor has some weird (to me) ideas about indentation that I've only been able to "resolve" by using a very specific combination of options that involves mixing tabs and spaces in a way that makes me sad.

Additionally, while the GDB integration is good, it definitely feels like it's missing some extra polish. There's no automatic locals watching (have to explicitly add anything you want to watch the value of). The watch viewer itself is quite simplistic and watching the values of structs is ... a poor experience. Well, not until you start wanting to watch more complex values anyway. The whole watch window really needed to be implemented as an expanding/collapsing node tree type of viewer, but it's not sadly. And there's no memory viewer. The disassembler viewer really lacks a source code + assembly code combined mode. It's not the end of the world, as when you're stepping through code you can see both and the current line is highlighted in both places. But still... it makes browsing the assembly view somewhat more annoying than it needed to be. Bleh. Oh, and there really isn't any better GDB frontend for DJGPP/DOS that I'm aware of. This is pretty much the best option. And to be fair, for the time, it wasn't bad at all. It's just that my nostalgia only goes so far.

So anyway, back to Watcom. The version I picked up (10.0a) is 4 years older then the version of DJGPP I've been using so far. So I didn't have any expectations about getting a "better" compiler or anything like that. But I was pretty excited to try it out because it really feels much more "true" to the spirit of using a 486 to do DOS programming as Watcom was the compiler that the majority of my favourite DOS games were written with. A silly reason, for sure, heh.

I'd heard the "new" IDE that Watcom introduced with 10.0 was kind of not good, and certainly that was my reaction to trying it out as well. It requires Windows 3.x and it does work. But it's pretty weird. The "IDE" is actually comprised of a series of separate executables that are all invoked from the main "IDE" (what is actually called the "IDE" by Watcom). It's more of a "project manager and GUI build tool" then anything else. You can see all the source files in your project and configure various compile and link options. Clicking on a source file causes a separate text editor application (and hence, a separate window) to open. Same goes with things like Windows resource files, etc. All separate tools. It's pretty weird. No other IDEs even at that time were like that. Certainly not any of the well known ones anyway. Kind of makes me curious what their reasons were for going with this design. Anyway, it's certainly not something I want to use. Additionally, I found that debugging graphical DOS programs launched from within Windows was problematic and resulted in crashes sometimes while stepping through code. So the Watcom IDE is out.

I had to admit at this point that I was really quite unfamiliar with what DOS text editors were out there. Turns out there was quite a lot. Many even had fancy features like syntax highlighting. Some also had fancy macro languages for additional configurability. Nice. Should be easy to find something suitable.

Heh.

It would take too long to write fully about this experience, but suffice it to say that while I didn't try everything on that list, I did try quite a lot. I realized that I am quite picky about the text editor I use. In weird ways anyway, probably mostly born out of habit and fear of change. Also anything based on VIM or Emacs I disregard because that's just how I roll.

I basically wanted the following features at an absolute minimum:

  • Syntax highlighting.
  • Multiple files open at once and able to easily switch between them.
  • External tools launching (to launch batch scripts / makefiles to run/compile/debug).
  • Ability to run in 80x50 text mode.

Not too much to ask for I think. Multiple editors had support for all of these in fact. The first one I decided to look at was SETEDIT because it's what RHIDE was made from (or is that the other way around?). I quickly discovered that launching DOS/4G programs from within CWSDPMI programs can be problematic. The DOS version of SETEDIT is built with DJGPP which means it uses the CWSDPMI DOS extender. Programs that I was going to be building with Watcom would make use of DOS/4GW. Additionally some of the Watcom tools, such as the DOS debugger WD, also use DOS/4GW. I was unable to launch the Watcom WD debugger from within SETEDIT without causing crashes. While some test programs I built which also used DOS/4GW could be launched fine, I didn't like the idea that down the road there might be problems. Ugh.

As it turns out, a lot of the better text editors were also built with DJGPP. All of these that I tried which supported launching external tools/programs also exhibited the same crashing problems when launching some DOS/4GW programs like WD. Ergh.

I do want to specifically say that I really liked FTE. Just a great, simple, minimalistic, easy to use editor which for some reason really resonated with me. However, it is built with DJGPP so it was sadly a no-go. That being said, I have a "conversion" project on my backlog list now, heh. If I decide to use Watcom over the long term, I intend to fully investigate whether it's possible to convert FTE to compile with Watcom so as to hopefully remove the crashing issues with DOS/4G programs.

Another editor I wanted to point out is something called "Power View IDE" which is actually a full DOS IDE for Watcom C/C++ that some guy wrote! I was ecstatic when I found this linked on some random forum post, seemingly as an afterthought by the poster. I don't think there is any existing project website for it, but it can be downloaded here. I couldn't believe that there was so little information out there about this and basically no one ever mentioned it anywhere.

This editor I think could have been exactly what I was looking for. Except it wasn't. It had some show-stopping bugs which made me oh so very sad. For one, when scrolling through text using the arrow keys on the keyboard (as I sometimes do), the editor would periodically insert random numbers ... !? What the heck? Maybe a bug in the keyboard handler? I took a look at the code and nothing stood out to me, so not sure what to make of that one. But yeah, an editor that randomly inserts random characters into your code is instantly disregarded. Another issue was that it didn't seem to correctly handle restoring the video mode when returning from executed programs. This became problematic when running the editor in 80x50 text modes which was a requirement of mine. Oh well. I guess the buggy nature of it is why no one ever really seemed to talk about it from what I could see.

To make this discussion of text editors not go on for many more paragraphs I'll cut to the chase and say that I eventually settled on Aurora. It was actually my last resort option, heh. While it has quite a lot of weird quirks that were initially off-putting for me, it is an amazingly customizable editor due to it's weird macro language "AML." Everything in the editor itself is largely implemented in AML, from how text editor behaviours work, to the menus, to the keyboard shortcuts, to file open/close dialogs, etc, etc. The DOS executable is basically just a runtime environment for AML and the Aurora text editor just comes with a pre-built set of compiled scripts that give you a text editor. Neat. It's reasonably well documented and the macro language has support for quite a lot. It appears to be a 16-bit real-mode DOS executable so no problems executing DOS/4GW programs from within the editor.

However, it had no real built in support for some nice "quality of life" things such as capturing build errors and displaying them nicely in the editor. "Well, with all the things AML supports, how hard can it be to implement this myself in exactly the way I want it?" This line of thinking kind of spiraled out of control quickly. Suffice it to say, there were way more quirks with AML then I had expected and the debugging facilities within Aurora were barebones at best. I think the hardest part was actually managing the lifecycle of a plugin that is meant to be called again and again from a menu item. It wasn't straightforward and was complicated by the fact that the description of how AML's "objects" work made me think more of traditional OOP style programming ... and it kind of isn't like that in a lot of ways. The extensive documentation unfortunately doesn't really tell you anything about actually building plugins for the editor specifically. Nothing in the way of concrete examples with explanations for how to do common things or why it's done this way, etc. Instead, it tends to just focus on the language basics and some API stuff... nothing to tie it all together. Oh well. I should be happy that it was as well documented as it was really. After suffering through some bizarre and hard to debug editor crashes from poor lifecycle management, I got something working!

At this point, I had put off "fixing" all the subjectively weird behaviours and keybindings that Aurora has out of the box (even when selecting a keybinding scheme during installation that was supposedly based on an editor I was mostly familiar with). Finally got that all mostly sorted out and at this point was thinking to myself "wow, this REALLY all had better have been worth it!" Meaning of course, that I really hoped I wouldn't grow to hate Watcom, heh.

The "WMAKE" menu I added contains options for Build, Rebuild, Clean, Run, Debug and Show Errors. Build errors will show up automatically after any action has finished, but it's a quick key-bound menu item to easily bring back up the error window if it gets hidden. These WMAKE actions basically require a makefile that contains targets for each menu item. Which is totally fine, as that was basically how I envisioned the makefile structure I was going to use anyway.

target_config = debug

target_name = dgl_test

object_files = &  
    blit.obj &
    clipping.obj &
    dgl.obj &
    draw.obj &
    gfx.obj &
    internal.obj &
    keyboard.obj &
    mathext.obj &
    mouse.obj &
    pcx.obj &
    util.obj &
    test.obj

cc_flags_debug = /d2 /zp4 /5r /fp3  
cc_flags_release = /d1+ /zp4 /5r /fp3 /oneatx  
cc_flags = /mf $(cc_flags_$(target_config))

link_flags_debug = debug all  
link_flags_release = debug all  
link_flags = $(link_flags_$(target_config))

.c.obj: .AUTODEPEND
    wcc386 $[. /zq $(cc_flags)

$(target_name).lnk: $(object_files)
    %create $^@
    %append $^@ NAME $(target_name).exe
    %append $^@ SYSTEM DOS4G
    %append $^@ OPTION QUIET
    @for %i in ($(object_files)) do %append $^@ FILE %i

$(target_name).exe: $(object_files) $(target_name).lnk
    wlink $(link_flags) @$(target_name).lnk

clean : .SYMBOLIC  
    del *.obj
    del *.err
    del $(target_name).exe
    del $(target_name).lnk

.NOCHECK
build : $(target_name).exe

.NOCHECK
run : $(target_name).exe  
    $(target_name).exe

.NOCHECK
debug : $(target_name).exe  
    wd /swap /trap=rsi $(target_name).exe

Basically relying on WMAKE to launch what I need once builds complete.

If that one other weirdo like me who wants to use Watcom C/C++ with the Aurora Text Editor comes by this site and reads this, here's my WMAKE.AML plugin. The two main things that are currently missing are "click on error to goto sourcefile/line" support and some kind of better window management (like, e.g. automatically resizing existing editor windows to have the errors window always visible on screen). The former will likely be added soon as it should be simple to add.

So anyway ... surely I would not have to all of this effort without having some idea of whether it was going to be worth it at the end right? Of course. Heh.

I'm quite a big fan of the Watcom debugger WD (previously called WVIDEO... not sure where the "video" part of that name comes from honestly).

It's kind of got a bit of a Visual Studio C++ debugger thing going on that I like (and this was even from before Visual Studio C++'s debugger got good I think). I think the one thing I dislike about it so far is that abruptly terminating a running program can leave your system in a weird state. For example, it doesn't seem to always restore overridden interrupt handlers. So not letting your running code run to the point where it removes it's custom keyboard interrupt handler can leave you unable to do anything once the debugger exits and returns to your text editor. Oops. DJGPP and GDB (and maybe CWSDPMI?) seemed to do a much better job of cleaning things up in these cases so your system didn't crash very often (though it wasn't perfect).

As for Watcom in general, I like that DOS/4GW seems to get out of your way a little bit more than CWSDPMI does. For example, the first 1MB of memory is directly accessible out of the box, so I can write into 0xA0000 immediately, or read directly from some memory in the BIOS to e.g. read the RTC tick count, or the BIOS fonts, etc. Just as DOS should be! Additionally, I can just call _dos_setvect and _dos_getvect directly as I would in a 16-bit realmode program and it will basically "just work" when I want to implement my own interrupt handler. Of course, it's not quite that simple in all cases, like when you want to use some bit of shared DOS memory and communicate with some process or piece of hardware that is running in 16-bit mode. But for the simple cases, it is simple. Don't need to have a bunch of _go32_dpmi_blahblah calls around. This will come back to bite me in the ass though, as DJGPP provides a relatively nice API for interfacing with CWSDPMI when you (inevitably) need to for the more complicated scenarios. DOS/4GW doesn't provide you with as nice an API... you just call the standard DPMI server interrupt. However, this is mostly documented in my stack of Watcom manuals, so that's a nice plus! Even some examples such as how to implement bimodal interrupt handlers.

On the topic of code generation though, the version of GCC from DJGPP that I was using is definitely better (as expected, it is 4 years newer).

We can take a look at the following very simplistic example which shows how something simple like function inlining is handled between these two compilers. In this example, I was using GCC flags -O2 -m486 -ffast-math and WCC386 flags /zp4 /5r /fp3 /oneatx both of which are recommended flags for optimum performance (could do -O3 with GCC of course though).

This example is something I noticed while running some simplistic benchmarks to see how much performance I would lose with Watcom. This snippet isn't even what I was benchmarking... it's just some setup code to generate a random 16x16 sprite for the actual benchmark that appeared later in the code. However I noticed it wasn't inlining a call the way I was expecting when looking at the assembly output.

// in a header file
static int surface_offset(const SURFACE *surface, int x, int y) {  
    return surface->width * y + x;
}

static byte* surface_pointer(const SURFACE *surface, int x, int y) {  
    return surface->pixels + surface_offset(surface, x, y);
}

static void surface_pset_f(SURFACE *surface, int x, int y, int color) {  
    *(surface_pointer(surface, x, y)) = (byte)color;
}

// elsewhere, in main()

bmp = surface_create(16, 16);  
for (x = 0; x < 16; ++x) {  
    for (y = 0; y < 16; ++y) {
        surface_pset_f(bmp, x, y, rand()%255);
    }
}

As a quick aside, before anyone looks at the above code and says anything like "Wow, that would be slow anyway! You really don't want to be using anything like an explicit pset type of function call anywhere! Look at that slow multiplication for god's sake!" I know. For any code that's running in an inner loop in a performance-sensitive scenario, I would always use, e.g. the above surface_pointer function to get a direct pointer to the memory before the loop and then use that pointer in the inner loop itself. And if it was really important to save every last CPU cycle, I would probably write out the surface_offset calculation directly without the multiplication anyway. These are all just convenience functions to be used as appropriate. The above for loop code is just an example!

GCC inlines the call to surface_pset_f as I would expect:

  21:test.c        ****     bmp = surface_create(16, 16);
 185 007e 6A10             pushl $16
 186 0080 6A10             pushl $16
 187 0082 E879FFFF         call _surface_create
 187      FF
 188 0087 8945E8           movl %eax,-24(%ebp)
  22:test.c        ****     for (x = 0; x < 16; ++x) {
 190 008a 31F6             xorl %esi,%esi
 191 008c 83C410           addl $16,%esp
 192 008f 90               .align 2,0x90
 193                  L238:
  23:test.c        ****         for (y = 0; y < 16; ++y) {
 195 0090 31DB             xorl %ebx,%ebx
 196 0092 89F6             .align 2,0x90
 197                  L242:
  24:test.c        ****             surface_pset_f(bmp, x, y, rand()%255);
 218 0094 E867FFFF         call _rand
 218      FF
 219 0099 89C2             movl %eax,%edx
 220 009b BFFF0000         movl $255,%edi
 220      00
 221 00a0 99               cltd
 222 00a1 F7FF             idivl %edi
 223 00a3 8B45E8           movl -24(%ebp),%eax
 224 00a6 8B38             movl (%eax),%edi
 225 00a8 0FAFFB           imull %ebx,%edi
 226 00ab 01F7             addl %esi,%edi
 227 00ad 037808           addl 8(%eax),%edi
 228 00b0 8817             movb %dl,(%edi)
 230 00b2 43               incl %ebx
 231 00b3 83FB0F           cmpl $15,%ebx
 232 00b6 7EDC             jle L242
 234 00b8 46               incl %esi
 235 00b9 83FE0F           cmpl $15,%esi
 236 00bc 7ED2             jle L238
  25:test.c        ****         }
  26:test.c        ****     }

However, I noticed wcc386 did not. Not only that, no amount of tweaking of compiler settings and/or #pragmas would get it to. It only partly inlines the call to surface_pset_f:

    bmp = surface_create(16, 16);
 142a  ba 10 00 00 00                    mov     EDX,00000010H
 142f  89 44 24 04                       mov     +4H[ESP],EAX
 1433  89 d0                             mov     EAX,EDX

    for (x = 0; x < 16; ++x) {
 1435  31 f6                             xor     ESI,ESI
 1437  e8 00 00 00 00                    call    surface_create_
 143c  89 c7                             mov     EDI,EAX

        for (y = 0; y < 16; ++y) {
 143e  31 c9             L27             xor     ECX,ECX

            surface_pset_f(bmp, x, y, rand()%255);
 1440  e8 00 00 00 00    L28             call    rand_
 1445  89 c2                             mov     EDX,EAX
 1447  bb ff 00 00 00                    mov     EBX,000000ffH
 144c  c1 fa 1f                          sar     EDX,1fH
 144f  f7 fb                             idiv    EBX
 1451  89 14 24                          mov     [ESP],EDX
 1454  89 cb                             mov     EBX,ECX
 1456  89 f8                             mov     EAX,EDI
 1458  89 f2                             mov     EDX,ESI
 145a  e8 00 00 00 00                    call    surface_pointer_
 145f  8a 14 24                          mov     DL,[ESP]

        }
 1462  41                                inc     ECX
 1463  88 10                             mov     [EAX],DL
 1465  83 f9 10                          cmp     ECX,00000010H
 1468  7c d6                             jl      L28

    }

 146a  46                                inc     ESI
 146b  83 fe 10                          cmp     ESI,00000010H
 146e  7c ce                             jl      L27

Elsewhere in that same module, surface_pointer_ can be found:

0070                    surface_pointer_:  
 0070  0f af 18                          imul    EBX,[EAX]
 0073  8b 40 08                          mov     EAX,+8H[EAX]
 0076  01 da                             add     EDX,EBX
 0078  01 d0                             add     EAX,EDX
 007a  c3                                ret     
 007b  90                                nop     

So, Watcom partially inlined the call to surface_pset_f but left a call to surface_pointer as a fairly lightweight function call. Note how Watcom uses registers to pass parameters to functions pretty much everywhere it can by default, thusly the surface_pointer_ function has none of the typical stack management boilerplate we might expect. From what I've read, this method of passing function parameters is one of the reasons it kept ahead of the pack regarding performance compared to other compilers at the time.

Whilst playing with compiler flags and other things, I decided to see how wpp386, the Watcom C++ compiler, would handle this:

    bmp = surface_create(16, 16);
 0063  ba 10 00 00 00                    mov     EDX,00000010H
 0068  89 44 24 08                       mov     +8H[ESP],EAX
 006c  89 d0                             mov     EAX,EDX

    for (x = 0; x < 16; ++x) {
 006e  31 f6                             xor     ESI,ESI
 0070  e8 00 00 00 00                    call    __16hkttSURFACE near * near surface_create( int, int )
 0075  89 c1                             mov     ECX,EAX
 0077  eb 06                             jmp     L3
 0079  46                L2              inc     ESI
 007a  83 fe 10                          cmp     ESI,00000010H
 007d  7d 2a                             jge     L5

        for (y = 0; y < 16; ++y) {
 007f  31 db             L3              xor     EBX,EBX

            surface_pset_f(bmp, x, y, rand()%255);
 0081  e8 00 00 00 00    L4              call    rand_
 0086  89 c2                             mov     EDX,EAX
 0088  c1 fa 1f                          sar     EDX,1fH
 008b  f7 fd                             idiv    EBP
 008d  89 14 24                          mov     [ESP],EDX
 0090  8b 11                             mov     EDX,[ECX]
 0092  0f af d3                          imul    EDX,EBX
 0095  8b 41 08                          mov     EAX,+8H[ECX]
 0098  01 f2                             add     EDX,ESI
 009a  01 c2                             add     EDX,EAX
 009c  8a 04 24                          mov     AL,[ESP]

        }
    }

 009f  43                                inc     EBX
 00a0  88 02                             mov     [EDX],AL
 00a2  83 fb 10                          cmp     EBX,00000010H
 00a5  7d d2                             jge     L2
 00a7  eb d8                             jmp     L4

Interesting! wpp386 fully inlines surface_pset_f. The exact same compiler flags were used in both cases.

After discovering this difference, I decided to compare wpp386 vs wcc386 with some other pieces of code and saw that there are definitely some other code efficiency differences. I've not investigated fully yet, but it does seem like wpp386 generates slower code in some cases too. I'll probably do a follow up post once I've dug into it a little bit more. One of the things I want to play with is re-working the code for surface_pset_f, surface_pointer and surface_offset to see if it's just a problem caused by the code optimizer "giving up" at some point because of how the code is written. It definitely was a thing that you'd have to at times help the compiler out by simplifying or otherwise breaking up complex expressions into separate chunks so it could do a better job.

Once nice thing about using wpp386 for even 100% C code is that it at least gives you more warnings and the support for toggling off warnings that you don't care about (like "W202" with wcc386 ... ugh) actually works. Turning up the warning level with wcc386 is kind of useless for this reason. I'm assuming this is just a bug and was probably fixed with 10.5 or 11.0.

Inline assembly is similar in some ways to GCC, but no need to deal with the AT&T syntax which I dislike. The way you do it with Watcom 10.0 is unfortunately more complicated then using asm as you might with other C compilers of the time (and modern compilers too), but it has it's benefits too:

void REP_MOVSD(void *src, void *dest, int num_dwords);  
#pragma aux REP_MOVSD =     \
    "cld"                   \
    "rep movsd"             \
    parm [esi] [edi] [ecx]  \
    modify [esi edi ecx];

Then you could simply call REP_MOVSD in your C code in the exact same way you would call any other C function. Inline assembly written in this manner would be inserted into your code at compile time in the same way as an C function that is being inlined by the compiler would be.

Compare to how you'd do it with DJGPP/GCC:

#define REP_MOVSL(src, dest, num_dwords)            \
    __asm__ __volatile__ (                          \
        "cld\n\t"                                   \
        "rep\n\t"                                   \
        "movsl"                                     \
        : : "S" (src), "D" (dest), "c" (num_dwords) \
        : "%ecx", "%esi", "%edi" )

Works the same way when calling from your C code (REP_MOVSL can be called as any other function, and is inlined, etc.). Also the GCC __asm__ is more flexible like what you saw with other compilers and the asm keyword, so you could just use it wherever in your code with ease (so you don't have to use it in a macro as I'm doing here).

If I'm being honest, both methods are super ugly to me. I will never be able to commit either form to memory and so, every time I need to write inline assembly I'll need to be referring to some documentation to copy it out. Watcom 11.0 apparently added support for the asm keyword, but apparently even with that, Watcom recommended the #pragma method because of its ability to indicate to the compiler what registers are used to pass values in and what registers were modified by the inline assembly code, something that the compiler couldn't do with the asm keyword I guess.

My Custom 486 DX2 Build

Since it's basically complete now, I figured I'd post something detailing it. Actually, it's been mostly complete for a while now, but the last couple small touches came together this week.

Despite the fact that I got a perfectly capable mini 486 PC earlier this year, I still decided I wanted to build my own similar system in a baby AT form-factor case. This way I could customize it more to my liking. I also really wanted to use a specific style baby AT case that matched the old 386 PC my family used to have. It's nothing special at all, but it's kind of stuck in my head over the years and become synonymous to me with "old DOS PC", so I pretty much was unwilling to let this "requirement" of mine go.

I began picking up parts during the summer and piece by piece I've finally got everything assembled. Strangely enough, getting the particular AT case I wanted was hardest, even though from what I gather it was actually a very common style case back then. Maybe just a coincidence that I had a hard time finding one.

Actually this ended up being the second case of this style that I received. The first one was shipped and arrived broken (the plastic front was cracked and broken in multiple places). Got pretty lucky that I was able to get a second one now that I think of it.

So, what's inside? Let's take a look:

  • CPU: Intel 80486 DX2 SX911 66Mhz
  • Motherboard: FIC 486-PVT, 256KB L2 Cache
  • RAM: 16MB
  • I/O Controller: Promise Technology 560C PDC20630, VLB
  • Graphics: S3 Trio32 2MB, VLB
  • Network: 3Com 3C509B-TPC
  • Sound: Sound Blaster Pro 2.0 CT2600 and Gravis Ultrasound Classic 3.74 1MB
  • Hard Disk: Quantum Maverick ProDrive 540MB 3600RPM MV54A011
  • Floppy Drive: 3.5" 1.44MB Sony MPF920-E and 5.25" 1.2MB Teac FD-55GFR
  • CD-ROM: Torisan 6x CDR-S16
  • Power: Athena Power AP-AT30 300W

Since I ended up getting basically every piece separately, it almost certainly ended up costing a bunch more then it would have if I had bought some pre-assembled PC from eBay or some other place (which people do sell, tested and working, even today). However, given all the trouble I had to go through to get this working, it really feels a lot more satisfying having gone this route.

As one may expect given the era, contrary to building a PC today, a lot of things did not "just work" together out of the box with this PC. This is especially the case when talking about VESA local bus (VLB) hardware where stability was a common problem. Having to adjust jumpers on the motherboard and the different ISA cards to get the settings just right is quite a different problem from today's PC hardware. This can be even more difficult a task today because for some motherboards the documentation is sparse or even outright missing, leaving you to either have to figure it out by trial and error, or hope the (often limited) labels printed on the motherboard or ISA cards themselves are complete enough and self-explanatory.

Luckily for me, with this particular motherboard I was able to find PDF scans of the original manual. The other card that you'll usually want a manual for due to all the jumpers on it is the I/O controller. In my case, I was unable to find a manual, but the labels printed on the card itself were sufficient enough to get it working properly.

Well, actually, I'm not sure on that last bit to be honest! It did seem to be working perfectly fine up until a week or so ago where I started having problems with the BIOS detecting the hard disk. I thought at first that it might have been an indication that the hard disk I was using was dieing or dead, but after swapping it out for another, using a totally different I/O controller card and also trying different IDE cables, I was unable to eliminate the issue. And to make matters worse the issue only manifested itself sometimes. The only thing I've found so far that solved this problem was to connect all of the IDE devices to the IDE channel that runs on the ISA bus instead of VLB. This isn't too big a deal, as I believe that since I'm running DOS 6.22 primarily that I don't see any benefit whatsoever to having the hard disk running over VLB.

It is worrisome though that the problem only started to occur just recently and seemingly not a result of any other hardware changes (since it had been a while prior that I had made any). Either some other component in this build is starting to become a little flakey, or this is just the fabled VLB stability problems I've heard about rearing it's ugly head. At any rate, something I need to keep an eye on.

Two sound cards probably seems strange at first but there is a reason for both. Sound Blaster is a pretty typical card for the era. The Yamaha OPL chip on this one (and many other Sound Blaster cards) has a pretty recognizable sound that I quite like. And it's compatible with a great many DOS games, so it was an obvious choice.

The Gravis Ultrasound however is something I had never seen before, only heard of, and certainly was not nearly as common a choice at the time. It's an interesting card however, as it was one of the first (if not the first?) sound cards to support hardware mixing of multiple sound sources. Sound Blaster cards couldn't do this, requiring programs to implement their own software mixer which took up valuable CPU resources. In practice, this ended up not really helping many games that supported the Gravis Ultrasound though as almost all of these games supported Sound Blaster too and in an effort to keep the code simpler, would implement their sound engine to the lowest common denominator. This meant not utilizing the Gravis Ultrasounds hardware mixing support since no other card had it.

However, what I really wanted a Gravis Ultrasound for was for it's subjectively better sounding FM synthesis / MIDI playback. So far every game I've compared that supports both Sound Blaster and Gravis Ultrasound sounds better (to me) when configured to use the latter. In some cases, such as with One Must Fall 2097, the difference is objectively better.

For a 486 computer, VESA local bus graphics are pretty unique, and until PCI came around (which was quite soon after VLB was introduced), it provided the fastest graphics you could get. To me, it feels almost necessary to use VLB with a 486 machine since you cannot use VLB with anything else. Everything beyond this hardware level would be PCI and eventually AGP. Everything before, ISA. Using an S3 card is a pretty solid choice. All benchmarks I've seen and done personally have them as top-tier cards of the era. Compatibility is also excellent. Probably a more common choice would've been a Cirrus Logic card though.

Nothing much else to say really. A 6x CD-ROM feels mostly right, maybe a little bit fast for the era? I remember we had a double-speed CD-ROM in my family's 386. A 500MB hard disk is also quite large for a DOS PC really. It's funny to think of that, but then again, I remember my family's 386 had a 20MB hard disk and at the time I thought that was large! 16MB of RAM is plenty for a 486. Most DOS games would need less then 8MB of RAM, so 16MB is nice to have the option for running SMARTDRV, or using a RAM drive, etc. My DJGPP boot configuration sets up both and it makes a significant difference with compile times, still leaving 8MB of RAM free for everything else.