"Old School" Magic: The Gathering

Over the past two years I've got back into collecting and playing Magic: The Gathering a little bit.

I was first introduced to the game by some friends in my fifth grade class in 1995 (actually, it was a grade five/six split class, I was in fifth grade though). I remember we used to actually sneak in games during class. I was really intrigued by all the cards and the artwork especially. Thinking back now, I recall that the majority of these cards were from the Fourth Edition, Chronicles and Ice Age sets, which all would have been current at that time. One card that I remember one of my friends had which sticks out the most in my mind that I thought was just the coolest creature card ever:

I mean, a 7/7 flying dragon that makes your opponent discard their entire hand when they take damage from him. Wow! I really wanted one! Of course, only years later did I realize that it was really a rather poor card from a playability perspective. 8 mana total casting cost with a 3 mana upkeep? Both of these costs comprised of 3 different colours? Yeah, no thanks! By the time you had enough mana out on the table available for use, the game would probably be almost over... if it even got that far.

But then again, thinking back to how I remember our games going at the time... a lot of them really did go on for a long time! I certainly don't remember anyone at the time having optimized decks. Everyone I knew who played was younger (11-12 years old) and it was their parents who were buying them cards. As a result, assuming you even had enough cards to build a deck (60 cards or more), you were playing with what you had. It might have even been with all that you had. Which probably meant you were playing with some really "janky" stuff. Maybe even (*shudder*) a four or five colour deck! So, the idea of playing something like Nicol Bolas in your deck at the time didn't really seem so crazy as it does to me now. And no, none of us had dual lands, and certainly not any of the power nine cards!

Regardless, even though I really wanted a Nicol Bolas card for myself, it wouldn't be until 23 years later that I got one, heh.

Getting a specific card wasn't even the first problem for me at that time. Getting any cards was the problem. I didn't have any money. Heck, I didn't even know where to go to buy Magic cards in the first place. I lived on a farm near a small town out in the "middle of no-where." There were no stores around that sold such things (so far as I knew). Darn! As luck would have it, later on during that school year, one of my other friends who had moved away the previous year (but who I still would occasionally go and visit, spending the weekend at his house) gave me his collection of Magic cards! I didn't even know he had any, but I showed up one weekend, and noticed he had a bunch in his room, carelessly strewn about. I asked about them and he replied "Do you want them?" I was absolutely thrilled. I think it ended up being a little over 100 or so cards all told. Again, they ended up being all from current (at the time) sets. A majority of Fourth Edition, and a smattering of Fallen Empires, Ice Age and Chronicles. A lot of the cards were in poor condition, clearly having been played many times over asphalt at school during recess or lunch before my friend ultimately got bored of the game. Finally, I could play with my own cards!

My younger brother was also interested in the game after he saw these cards and I remember we would play at home. I don't think we had quite enough cards to build a deck for each of us (I seem to recall we were somewhat short on lands) so what we ended up doing was sharing the same deck. We would play as normal, but both would draw our cards from the same deck. At first, we didn't have a rulebook so we were playing from my recollection of the rules that I learnt from my friends at school... and even that wasn't so perfect (plus I don't think we were 100% correct in following the rules during our games at school anyway). A few things I recall us doing incorrectly: we allowed attacking one (or more) of your opponents creatures directly, there was no distinction between sorceries, instants or interrupts (and I don't even think we ever played them during the opponents turn... not sure we understood that aspect of interrupts and instants), we allowed attacking with walls, regeneration could be played from cards long-since put into the graveyard. There's probably more I'm forgetting, but if you're familiar with the actual rules of the game, that should give you an idea of our games. Also, I do distinctly remember that, in an effort to not upset the other, we wouldn't attack at all until we had run out of cards in our shared deck. At which point it would turn into a real battle-royale! Early on for our games we did attacks early in the game (as soon as a creature was in play), but due to the hodge-podge of cards in our shared deck, games would often be very one-sided, especially early on and the early attacks ended up just upsetting whoever was taking a beating so we stopped doing that. Hey, we were both young after all!

The summer after school ended that year, I remember being in the mall with my Grandmother (during a visit to her place, there was no mall within an hour's drive of my house) and her buying me a Fourth Edition starter deck (60 cards!) and a couple booster packs of Alliances at some kiosk near the food court that sold Magic cards. My brother also got some cards at some point (if I recall correctly, a Mirage starter box later that year). Our games started taking better shape, which was good because I was also playing with friends at school less and less as time went on.

I ended up staying somewhat into Magic cards into 2001/2002 or so and kept collecting here and there (in particular, I remember getting quite a lot of Mercadian Masques in 1999 and early 2000). I was in high school at the time and in 2001 I remember I discovered one of the history teachers left his room open for students to come and hang out in during lunch. There was a group of -- well, I can only describe most of them in one way -- "comic book store"-type nerds who played Magic there during lunch. I hadn't played the game with anyone other then my brother in a few years at that point, so I was excited to play against some new opponents. I "endured" it for a while, but ultimately got put off playing the game by this group. Really, just a rude bunch of players that clearly weren't there to have fun but rather, seemed to have their fun by insulting people like me who were not playing with well optimized decks and had less overall experience playing the game. So, I put the game aside from a number of years.

After finishing college in 2007, I was contacted by my friend who had originally introduced me to the game in 1995 inviting me to come to a "draft" he was organizing. I hadn't played in quite a while and the game, as I discovered when I went to this draft, had changed quite a bit in the look and feel of it. The core rules were of course mostly the same, but the look had changed to what I always felt was a more generic or even "sterile" look and the artwork on the cards generally felt a lot less inspired and seemed to lack the character or "charm" that a lot of the original cards that I remembered had.

But more then that, after attending a few such drafts over the next year or two I began to realize that the sets now were largely designed with drafting in mind. There weren't really any of the fun, imbalanced, and/or just plain weird cards that you would see in the older sets. This never really felt that fun to me, it just added to the generic/sterile feeling I was getting about the game. At some point in 2008 or 2009 I declined going to further drafts just saying that I'd basically lost interest in the game.

Fast forward to 2016. I don't really remember what made me want to look up Magic again. Probably I was looking at the box in my closet that had all my old cards in it. But I started thinking about how I enjoyed the older cards much more and wouldn't it be cool if there was some group of people out there who played strictly with these older cards? After googling a bit, I discovered that, indeed, this was actually a thing! Not an official format mind you. But heck, that's probably a good thing anyway given my dislike of where the game has gone over the past 15-20 years.

Most people interested in this format stuck with cards from the original sets released in 1993/94 which was a bit before I started playing, but since Fourth Edition and Chronicles were comprised of all reprints of cards from the original sets, it was all the same cards to me anyway. Great!

Looking up some cards on eBay and such ... wow, Magic cards sure are expensive! Especially the older cards! However, I proceeded forward and ultimately, much of my disposable income in 2016 went into buying older Magic cards. Eventually, I was able to piece together quite a collection of cards that fit into the much more strict Swedish rules for "Old School 93/94" Magic, which is the ruleset that I had decided at the time that I was going to build for (mainly because I largely consider the much cheaper Revised edition cards to look somewhat ugly).

Currently, I am happy that I've been able to build three separate decks for this format:

They are most certainly not the best most optimized decks out there, and my ability to actually play the game is still rather limited due to not getting much experience at it, but even still, when I do get to play I enjoy it. This era of Magic was just a lot more fun to me and I attribute that (and only that) to my regaining interest in the game over the past two years.

Unfortunately for me, while there seems to have been a much bigger "Old School" Magic community here in Toronto years back (see here and here), it has diminished drastically since then. I currently know of only one other person in Toronto who plays and one other in Hamilton. I meet up with the person from Toronto every other month or so for some games and we have fun, but even still, it doesn't scratch my itch to play a bit more.

Thankfully this "lack of players" problem seems to be common, which you can imagine for what is a very niche (and especially now in 2018, prohibitively expensive) format of the game. So people in the online community from around the world have started playing games over Skype and other webcam-enabled methods of communication. I've only played one game this way so far, and wasn't sure what to expect exactly (I was imagining lots of connectivity and audio issues, as that's how almost every Google Hangouts session I've ever been in has gone), but it went better then I expected and I'm looking forward to playing more this way! It's nice that in 2018 there exists another way to connect all these people who enjoy this particular format of the game.

Recently, I figured that since my introduction to the game was largely via the Fourth Edition set originally released 23 years ago in 1995, that I would treat myself to some sealed old stock of this set. Buying any sealed old stock of Magic and opening it is guaranteed to lose you money, so one cannot treat it as an investment, but rather as an indulgence into nostalgia.

I suppose it's important to point out that, with my goal of building my "old school" decks within the boundaries of the stricter Swedish rules, using Fourth Edition cards in my decks is not possible. Unless I decide to follow the more lenient Eternal Central rules, which I may do at some point given the way that prices are going up recently... At any rate, I bought this just for fun, nothing else!

One of the two player "gift boxes" for Fourth Edition. Containing two 60 card decks, a rule book (a slightly bigger one then you'd get ordinarily with a starter deck box), and a (in my opinion) kind of nice flannel bag with glass counters. There was also a mostly equivalent gift box set for Revised Edition, but it's far more expensive due to the possibility of it having dual lands (which are expensive cards).

The bag holding the glass counters had broken open on it's own somehow over the past 23 years and they were scattered in the box when I opened it, but no harm was done. It's kind of funny seeing the mail-in response cards. A part of me wants to try sending it in, but I think I'll just leave it here in the box for completeness. The little black flannel bag to hold the counters is a bit nicer then I expected. Not super great quality or anything, but after seeing this now, I think I'm definitely going to use it with these glass counters for all my games going forward. As I understand it, the counters were intended to be used to track the the player's life during a game. However, nowadays there are mobile apps that do this task just as well (if not better). Instead, using these to track tokens and counters as needed for cards during a game seems a much better use to me.

But let's get on to the sealed decks. As per the description on the back of the box shown above, "this box contains everything two people need to play", so what will these decks actually look like?

Heh. So, if you've ever opened a Magic starter deck box (60 cards), then you'll instantly know that despite the fact that these two "decks" were packaged differently in a "two player" set ... this gift box just contains two normal starter decks. That is, 60 randomly assorted cards with lands, and the usual amount of rares, uncommons and commons. They're not specially prepared into anything even remotely resembling what one could consider to be a playable deck.

Well, actually now let me think about that, heh. If I think back to how I used to play this game with my brother back in the mid-90's (described above in this post) then actually for me, these two "decks" here are actually quite reminiscent of how we played! Lots of random cards with no theme or strategy. Just a "play with what you have" kind of feel to it. Now, if you were a player who had money and was not a young kid just getting into the game and had the knowledge to construct an optimized deck, then no, these two decks most certainly are not "playable" to you.

Aside from all of that, it brought a big smile to my face shuffling through these cards. It's always fun to see a Craw Wurm. Who doesn't like big creatures. And 6/4 was pretty damn big! With the way my brother and I played back in the day, Howl from Beyond was a very strong card given that we typically left our attacks until the end of the game (at which point you had a lot of lands out and could hugely buff one of your attackers... and at that point, we only had one Howl from Beyond, so it often was a game winner). I remember we both never really thought much of Erg Raiders ("why would you want to play a card that hurts you every turn you don't attack with it!?"). Always nice to see a Lightning Bolt of course. Terror was a card that we also had only one or two of back then, and it was something we loved drawing and using to instantly kill something of the others. Cards like Holy Armor and Firebreathing were also quite sought-after for us during games, as anything that buffed your creatures was good for our typical end-of-game battle-royale-style attacks. For me now in 2018, seeing Greed, Hypnotic Specter, Fellwar Stone, Strip Mine, Millstone, Power Surge amongst others are also all quite nice.

Fourth Edition, and to a little bit of a lesser extent, Ice Age will always remain my favourite sets of Magic. Not just because it's what I started playing with, but also because I always thought that the cards looked their best (highly subjective opinion of course) at this early point in the game's life while still retaining the majority of the original cards and artwork (in the case of Fourth Edition anyway). Chronicles also continues that, though with a smaller pool of cards. Revised edition, as I mentioned earlier, looked rather ugly to me. Alpha, Beta and Unlimited edition cards look rather nice (I prefer Beta to the more rounded corners of Alpha), but to me there was always a certain ... crudeness (?)... to them. I'm not sure if that's really the right word honestly. Perhaps it is, but I feel someone will read that and get an exaggerated impression of what I mean. They were of course the earliest editions (as evidenced by the names "alpha" and "beta", heh), but they did always have a little tiny bit of an unpolished feel to me. However, that feeling could largely be because I didn't see these editions until after I was already accustomed to Fourth Edition, Chronicles, and beyond. I can see how these simpler looking cards from the earliest editions would be more appealing to some. And to be clear, I am not saying I dislike them by any means. Quite the contrary as I covet my existing collection of Alpha/Beta/Unlimited cards!

(Just ignore the blatantly obvious centering issue on the Fourth Edition card on the bottom right, heh. That kind of thing happens in any edition.)

I suppose this all just helps to demonstrate how our early encounters with things colour our perceptions of it later on.

Updated libDGL Code

A quick post just to point out that I updated the libDGL Github repository with the most up to date working code that I currently have.

Since I originally pushed libDGL code to Github last November, not much new functionality/features has been added. Kind of disappointing for me to think about actually, heh. That being said, over all that time, I do feel like I fixed up a bunch of bugs and generally improved the performance of what was there. However, looking at my to-do list that is left for libDGL, I still really have my work cut out for me:

  • Scaled/rotated blitting support
  • Blending
  • "Mode 7" like support
  • Custom font loading (BIOS-like format?)
  • Joystick / Gravis GamePad support
  • Input device (keyboard/mouse/joystick) events
  • PC speaker sounds
  • Sound Blaster compatible sound/music
  • Gravis Ultrasound compatible sound/music
  • Sine/cosine lookup table optimizations
  • BMP, LBM, GIF image loading (and saving?)
  • Simple immediate mode GUI

This list is definitely not in any particular order. I want to start building a simple 2D map editor tool (since the old QBasic one I have sitting here is missing source code, so I cannot even just extend it as a quick alternative), so the last item about a "simple immediate mode GUI" is probably going to be my next task.

Following that, I kind of what to do something with audio. I've been focusing a lot on graphics lately and feel like a change would be nice. More specifically, I think starting with some MIDI playback might be fun. I just recently picked up a Roland Sound Canvas SC-88VL (through which, MIDI songs sound absolutely exquisite) and this is most probably influencing that decision, heh. However, I think I'd likely want to start with writing MIDI playback code for a Yamaha OPL as that was far more commonplace, but supporting general MIDI devices also sounds like a nice second step.

Fixing Up an IBM Model M2 Keyboard

A few weeks ago I picked up a Model M2 keyboard from eBay. I'm not a raving fan of mechanical keyboards but I definitely agree that they are quite nice to type on. I actually got a Das Keyboard Model S Pro years ago but haven't used it recently since it's a Windows keyboard layout and my modern computer is a Mac and I've just grown to hate using Windows keyboard layouts on Mac. But otherwise, it's quite nice and I wouldn't hesitate to recommend a Das Keyboard to anyone. Unicomp also makes apparently very nice mechanical keyboards along the style of the original IBM keyboards but I've not tried these personally.

At any rate, I picked up this Model M2 to use with my "retro" computers. Not really for any particular reason other then it feels era-appropriate and is a nice relatively compact buckling spring keyboard. Many more people would prefer the original Model M over this, but I've always been put off on getting one of those due to the bulky size.

The Model M2 is infamous for having bad capacitors. The two capacitors on the controller inside will apparently go bad (dry out) quicker if unused for long periods of time, so even a brand-new-in-box M2 keyboard isn't guaranteed to work.

And as if bad capacitors weren't bad enough, the M2 is also infamous for being difficult to take apart and to reassemble (probably harder to reassemble then to disassemble I think). Apart from two screws on the bottom, the majority of the keyboard is held together by somewhat easy-to-break plastic clips internally that need to be very carefully opened. Oh boy.

It should be noted that not all Model M2 keyboards are mechanical. Certain ones made by Lexmark with a model number beginning with '7' are rubber dome. But otherwise from the outside they look identical.

This one I got arrived and initially didn't work when I plugged it in to give it a try it. It gave the tell-tale sign of bad capacitors where only two LED lights flashed on and stayed on and no key-presses were ever registered. However I noticed that by unplugging it and replugging it in that it worked perfectly. I continued using it for a couple weeks like this and all was good, but I knew that this wasn't a long-term solution and that I really did need to go and replace the capacitors.

Onwards to disassembling!

First thing is first. Take a picture of the keyboard before you take anything apart. This is so you can use it as a reference for where the keys all go when you're putting it back together.

To begin, the keycaps all need to be removed. This is because the aforementioned plastic clips that hold the keyboard together are underneath the keycaps. And you'll probably want to clean the keycaps anyway. Mine actually weren't that dirty as you can see from the above photo, but I still cleaned them anyway.

Removing the keycaps is really simple. You can use any thin flat tool to pop them off. I began by popping off all of the square keys. A number of the longer keys such as Backspace, Enter and the space bar have additional little brackets that attach to the bottom that need some extra care, so it's best to save these for last.

With keys like Enter shown above, I found it easiest to pop up the keycap first in the same exact way as I'd done for every other square key, but before trying to lift it off completely, you take a some dull flat/thin tool and press down the bar so you can easily slide it out. It's very easy, but you do need to be careful as the plastic that holds the whole bracket to the keycap is very thin and easy to break!

The space bar is a little bit different then all the other keys. Again, I started by popping it up in the same way as the other keys, but again, before trying to lift it off, you need to release the bracket. This one is different then the other brackets and is held down by two little bars that need to be pushed out. You push the left one out to the left and the right one gets pushed out to the right. Use a dull, flat, thin tool again to push them out. They are a little tough to push out, but once you get the first one the second one is easy. Again, be very careful as the plastic is thin and easy to break!

Now the keycaps are all removed.

At this point, you'll want to take another picture. This is important because as you can see, not all of the holes have springs in them! All of the missing springs are the extra holes that are covered by the longer keys which all only need one spring each just like the smaller square keys.

If you're cleaning the keycaps, get some soapy water ready and let them soak for a good hour before doing any scrubbing. That'll give you plenty of time to do the rest of the disassembly and maybe even get the capacitors replaced too depending on how things go.

As you'll be able to see in the above picture of the keyboard without the keycaps on, there are 13 small plastic clips that need to be separated to remove the top plastic half of the keyboard. You again use your dull, flat and thin tool to separate the two plastic parts of the clip, but I found that they would not stay separated and trying to push down to move the other half of the clip lower so it would not reattach was tricky and not guaranteed. The whole plastic of the keyboard is somewhat flexible so if I got one clip to stay separated, once I picked up the keyboard there was a very good chance the whole thing would flex a tiny bit and the clip would somehow find its way back and snap together again. Super frustrating!

So, I figured I needed something to wedge many of the clips apart while I pried off the top plastic cover on the keyboard. Not having much in my apartment to use for wedging, I turned to my little stack of spare computer expansion slot/bay covers. This actually worked much better than expected and I was able to turn the keyboard on it's side to begin prying it apart with the confidence that the clips would stay apart.

If you decide to go this route, whatever you use as a wedge should really be thin and hard (so as not to bend/flex while wedged in the clip). A very common complaint about these keyboards is the ease with which these plastic clips break, so you really don't want to flex them more then you have to!

As you can see in the photo on the right, you need to pry apart the top and bottom halves of the keyboard from the side. I started from the bottom and once I got it apart enough, used my finger nails to keep it apart while I worked on the top half. Eventually got it open enough to fit the tool I was using in. By this point, the expansion slot cover wedges that were closest to where I was opening the keyboard from were falling out, as expected, since there was nothing to hold them there once I started opening it.

I only had enough wedges for half the keyboard, so once I had it open enough so that the first 5 of the wedges had fallen out, I used my tool as a bigger wedge and left it placed between the two halves of the keyboard, set the whole keyboard back down and re-used those expansion slot cover as wedges, placing them into the remaining plastic clips on the opposite side of the keyboard. At this point, all of the plastic clips were either already apart or had a wedge in them and I was simply able to somewhat gently but firmly pull apart the top and bottom plastic halves of the keyboard as if I was slowly and carefully opening a book.

Luckily for me, I did not end up breaking a single plastic clip in the process! Hooray! Even if you broke a couple clips, it's not the end of the world. Hopefully though you don't break too many. If that does happen I imagine you could probably use a bit of hot glue to put them back in place.

I should point out that all the while you will probably notice and hear/see the buckling springs inside falling out of their place. Don't worry about this, but definitely do not try to close the keyboard again at this point else you'll probably squish and ruin some of these springs after they've been freely moving about inside. Just keep going with opening the keyboard and it'll all be fine.

Take all of the springs and carefully place them someplace safe for now, out of the way. We won't need them until we begin reassembly. You should also carefully peel up the thin black sheet/mat that the springs were sitting on. You can clean this if you like. I just brushed it off lightly with a dry cloth and then set it aside.

Some people report finding that the traces on the membranes corrode and go black or dark brown or whatever. As you can see, I didn't have that problem. Not sure what people do to fix that problem so unfortunately I cannot advise there. I would not recommend removing any of these membranes if you don't see any problems on any of the traces. It looks like it would be tricky to get them all back perfectly aligned and I've read comments from people saying as much. Just leave them as they are if you can.

And now the problem capacitors. You can kind of see in this photo that some of the contact pads underneath the capacitor has gone all dark brown, due to the capacitors starting to leak out. After seeing this, I was kind of surprised that the keyboard had worked for the couple weeks I'd been using it so far. I scraped off as much of the dark brown gunk from the solder as I could using my pocket knife, being very, very careful to not scrape any of the surface of the PCB surrounding it. Once I'd got enough of it off that I could see mostly solder, I got out my soldering iron.

You could of course try removing the controller PCB from it's position in the keyboard. As you can see there are several plastic clips holding it in place. It seemed that it would be quite tricky to remove to me, so I decided that I didn't want to risk breaking these clips. These capacitors are surface-mount, not through-hole, so technically there is no actual need to remove the PCB anyway in order to remove them. Plus it's only two capacitors and there is enough of the contact pads visible on each of them that I figured I wouldn't need to apply much heat anyway so there wasn't likely to be any harm caused by doing the whole re-capping with the controller left where it was.

Post-removal, I was left with more of a mess to clean up from all the leaking that had gone on over the years. Unfortunately I also pulled up a bit of the bottom contact pad while removing the smaller capacitor on the left. Whoops. With that in mind, I'm not sure I'm the best person to explain the process of using your soldering iron to remove these capacitors. But if you do still want to know what I did, basically, I just heated up the exposed area of one of the contact pads and gripped the capacitor with a pair of pliers and twisted the side being heated up away after I could see the solder had melted. Then repeated the same process for the other side. I think my problem was that I tried twisting the capacitor too soon when the solder wasn't melted yet, so twisting the capacitor away just ended up ripping up the contact pad in the process.

I used 99% alcohol and Q-tips to clean up the remaining gunk from the leaking old capacitors. I initially used my pocket knife to scrape up some hard bits without really thinking about it and ended up scratching a bit of the PCB. Dumb, dumb, dumb! Thankfully I didn't end up cutting a trace or anything. After this close-call I decided to just use my fingernail to scrape off the remaining bits of hardened gunk.

The replacement capacitors needed are a 2.2µF 50V and a 47µF 16V. You can of course go higher with the voltage, but should definitely keep the capacitance the same in any replacements you decide to use. Specifically, I used these two capacitors that I got from DigiKey.

I placed little squares of electrical tape down as a precaution. I wasn't sure if the capacitors would get pushed down and by how much (potentially putting the side of them in contact with the PCB) once the top of the keyboard was placed back on. Maybe it wasn't needed. Also, certainly anyone could do a better job of soldering then I did here, heh.

Before reassembling, it's important to test this out and see if the new capacitors are doing the trick. I took the keyboard over to my computer, plugged it in and powered it on and voila, no more stuck LEDs after a cold boot (without my previous unplugging and replugging it in trick)!

You can simply tap your finger on the membrane to test keys. I tested a bunch this way to make sure that everything was working fine.

At this point, it had been over an hour because I am kind of slow with these things. So it was about time to clean up all the keycaps and the top plastic cover of the keyboard. Once this is done, I set them all out, face up, on a towel and let them dry for a few hours. I actually used my DataVac Electric Blower Duster to dry off the top cover quicker as I wanted to get on with the reassembly sooner. But I did leave the keycaps to dry on their own for a few hours in the meantime.

To replace the buckling springs, you need to take the top plastic half of the keyboard and set it upside down, but with something to mount it up a bit higher. This is because when the springs are placed inside, the top of the spring will dangle slightly past the top of the plastic cover. So if you had it just resting flat on some surface you would not be able to correctly and fully insert each spring into it's little bracket. As you can see here, I'm using two hard disks on either side (because they were the only really suitable thing within reach as I sat down to do this, heh) to mount it a bit higher.

Then, using your previously taken picture of the top of the keyboard before you took it all apart, re-insert each spring into it's bracket, leaving the correct few spaces empty. It is absolutely important that each spring fits snugly into it's bracket on the keyboard cover. However, there's nothing to hold them in place other than gravity, so just be careful. When you're done this process do a quick once-over to ensure that they are all snugly in place. Trust me on this!

Now take the thin black sheet/mat that we removed and set aside before. Place it over top of the springs. Each of the holes in the sheet should line up with the various holes for the clips and two screws on the plastic cover. Unfortunately there is nothing to hold this in place.

And now is probably the worst part. We need to take the bottom half of the keyboard (that has the membranes and controller PCB in it) and place it on top of the top half of the keyboard with the springs in it. AND we need to do it while it's mounted slightly off the ground as we have had it thus far. This is incredibly important as otherwise the springs will pop out of place. In my case, holding the bottom half of the keyboard upside down did not result in the membranes falling out, but I would guess if that happens to you that you could use some small bits of tape to hold it in place. In my case, the black sheet would not stay in the bottom half of the keyboard while held upside-down so pre-placing it on the top half as shown in the above picture worked best for me.

Carefully hold the bottom half of the keyboard over top of the top half, lining it up while being careful not to accidentally shift the top half off of it's two supporting mounts and then set it down, pushing it together. DO NOT pick up the whole thing to attach the two plastic halves together You definitely want to leave the top half with the springs in it resting on your two mounts throughout the entire process. Go around all the edges and use your hands to squeeze all the edges together and you should hear all the plastic clips clip into place. If you pick it up (even slightly) to do this, you risk the springs falling out of place!

Apparently this exact problem happened to me with exactly one spring. Once I had reattached all the keycaps I was testing all the keys and noticed that the 'W' key didn't work unless pressed "just so." Taking off the keycaps again, I took a flashlight and looked down at the feet of the buckling springs.

It's maybe a bit hard to see in this picture, but the black feet of the spring for the 'W' key is very slightly crooked. The foot on the left side had somehow shifted out of place during reassembly and was outside of the plastic bracket that it should be sitting in. This was resulting in the key not pressing correctly (even though the sound of it pressing was just the same as every other key that worked fine). The fix for this was to take it all apart again and reassemble it. Not fun. So, be very, very careful when reattaching the bottom half of the keyboard to the top half with the springs in it! Take your time with it.

Once you've got that done, reattaching the keycaps is easy. Start with the spacebar and then do all the longer keys. Leave the simple square keys to the end as they are the most straightforward.

With every keycap, the goal is to have the top of the spring resting in the middle of the underside of the keycap. As you can see in the photo on the left, there is a small round slightly raised piece of plastic inside the bottom of the grove in the middle of the keycap. The top of the spring when inserted correctly will rest perfectly around that small round piece of plastic.

What is somewhat likely to happen when you're replacing the keycaps is that the spring gets caught on the open flat area at the top of the grove, or it ends up resting somewhere on the little plastic ramp thingy on the other side. If this happens you need to pop off the keycap and try it again. You'll know when you got it correct when you're able to press the key down and it makes the very same clicky sound as it did before you took it off in the first place. It it feels too mushy and, most importantly, does not make that clicky sound then the spring is not in the correct position. If you're not sure if it's making the correct clicky sound, assume that it's not correct and try again. If you're still not sure, try replacing a few other keycaps and compare the sounds.

Most of the longer keys have more than one grove. The grove that the spring goes in is always the one that has the top/bottom of the plastic cut away, as you can see in the photo on the right. I had a lot of trouble getting the number pad '+' and Enter keys on correctly. The springs just kept not sitting right when I popped the keycap back on. What ended up working for me was to tip the keyboard up, so it was resting on the top edge (IBM logo down), forcing the spring to be naturally a little lower (due to gravity) as I was inserting the keycap.

As you're replacing the longer keys with the bar/bracket thingy, use a tool to push the bar down slightly (and very carefully, you don't want to push it too much and break it!) so it fits under the clamps.

Finally, remember to replace the two screws on the bottom.

Heh, you probably can't even tell at a casual glance that that is a different photo then the first one I posted because the keyboard was relatively clean to begin with. This photo is definitely post-cleaning-and-fixing!

And that's pretty much it! I hope this helps someone out there. There are a number of guides to repairing and disassembling/reassembling the Model M2 keyboard that other people have written over the years but I always felt like there was some details missing, particularly with regard to disassembling. I wrote this post thinking about what details I would have loved to have going into this. It ended up being quite wordy, but well, sometimes (often) more details are better!

Using Watcom's Register-based Calling Convention With TASM

I suppose I'm writing this post for my own benefit primarily. I'll likely forget many of these details in a month, and then go and try to write a bunch more assembly and run into problems. So I'll try to proactively solve that future problem for myself. Everything here is better documented in the compiler documentation. However, it is scattered around a bit and of course isn't written with specific examples for using TASM.

One of the performance benefits that Watcom brought with it that was a pretty big deal at the time was that it's default calling convention used registers for up to the first 4 arguments to called functions. Past that, and the stack would be used as per standard C calling conventions.

As mentioned this calling convention is the default, but it can be globally changed via the CPU instruction code generation compiler switch. For example, /3 and /3r both select 386 instructions with register-based calling convention, while /3s selects 386 instructions with stack-based calling convention.

Borland Turbo Assembler (TASM) does not natively support this register-based calling convention among it's varied support for programming-language specific calling conventions. However it does let you use it's "NOLANGUAGE" option (which is the default if no language is specified) and then you can handle all the details yourself.

ideal

p386  
model flat  
codeseg

locals

public add_numbers_

; int add_numbers(int a, int b)
; inputs:
;   eax = a
;   edx = b
; return:
;   eax
proc add_numbers_ near  
    push ebp
    mov ebp, esp

    add eax, edx

    pop ebp
    ret
    endp

end  

This is pretty normal looking TASM. Complete with normal looking assembly prologue and epilogue code. Note that we are intentionally not specifying a language modifier.

So, first off, add_numbers_ has a trailing underscore to match what Watcom expects by default. If you don't like this for whatever reason, you can change the name here to your liking, but the use of a #pragma in your C code is necessary to inform Watcom about the different naming convention for this function.

Second, via the magic of the register-based calling convention, Watcom will have our two number arguments all ready for us in eax and edx. Our return value is assumed to be in eax, and that is correct in our case so we're all good.

The great thing is, we don't actually need to do anything fancy to call this function from our C code.

// prototype
int add_numbers(int a, int b);

// usage
int result;  
result = add_numbers(10, 20);  

But that was the simple case.

This register-based calling convention actually places the burden on the called function to clean things up before returning. This includes preserving some register values as well. According to the documentation: "All used 80x86 registers must be saved on entry and restored on exit except those used to pass arguments and return values." So, in our add_numbers_ function if we had wanted to use ecx, we would need to push and pop it during the prologue and epilogue code. But we didn't need to do so for eax and edx because those were used to pass arguments and return a value.

As mentioned previously, the stack gets used for arguments once all the registers have been used for arguments (by default, eax, edx, ebx, ecx in that order). In this case, the called function is responsible for popping them off the stack when it returns. So, if there were two int arguments that were passed on the stack, we would need to do a ret 8 to return.

; For this function, using the default register calling convention, the first 4 arguments
; will be passed in registers eax, edx, ebx and ecx. The last two will be passed on the stack.

; void direct_blit_4(int width4,
;                    int lines,
;                    byte *dest,
;                    byte *src,
;                    int dest_y_inc,
;                    int src_y_inc);
proc direct_blit_4_ near  
arg @@dest_y_inc:dword, @@src_y_inc:dword  
    push ebp
    mov ebp, esp  ; don't try to be clever and move this elsewhere!
    push edi      ; likewise, don't try to group the push's all together!
    push esi

    ; code here (that also modifies edi and esi, thus the additional pushs/pops)

    pop esi
    pop edi
    pop ebp
    ret 8
    endp

Is this all too cumbersome to worry about? Well, I don't really think it's a big deal, but there is a way we can remove ourselves from this burden.

Let's say we didn't want to have to worry about preserving any of eax, ebx, ecx, edx, edi, or esi regardless of how many arguments our function has and what (if any) return value it uses. Also, maybe we don't want to have to worry about popping arguments off the stack ourselves when our assembly functions return.

// define our "asmcall" calling convention
#pragma aux asmcall parm caller \
                    modify [eax ebx ecx edx edi esi];

#pragma aux (asmcall) add_numbers;
int add_numbers(int a, int b);       // no change to the function prototype is necessary  

What if we actually wanted to use the normal C stack-based calling convention for our assembly functions and ignore this register argument nonsense? Maybe you're using an existing library and it was written for other compilers that don't use this register-based calling convention.

#pragma aux asmstackcall parm caller [] \
                         modify [eax ebx ecx edx edi esi];

Watcom also pre-defines the cdecl symbol for this same purpose, which you can and probably should use instead of defining your own.

The empty brackets [] denotes an empty register set to be used for parameter passing. That is, we are saying not to use any registers, so the stack is used instead for all of them. With that in mind, we could expand the set of default registers used for parameter passing:

#pragma aux asmcallmorereg parm caller [eax edx ebx ecx edi esi] \
                           modify [eax ebx ecx edx edi esi];

In this case the modify list is redundant and need not be specified.

Of course, saying that your function will use/modify more registers means that the compiler has to work around it before and after calls to your assembly function which may result in less optimal code being generated. There's always a trade off!

None of the above #pragmas remove the need for the standard prologue and epilogue code that you've seen a thousand times before:

push ebp  
mov ebp, esp  
; ...
pop ebp  

The only exception is if your assembly function isn't using the stack at all.

There are many details I've left out. For example, passing double values will mean two registers will get used for one argument because doubles are 8 bytes. But if you only have one register left (maybe you passed 3 ints first), then the double value will get passed on the stack instead. Additionally there are more details to know when passing/returning structs. But I'm not doing any of this right now, so I've not really looked into it beyond a passing glance.

Attempts At Optimizing VGA Mode 13h BitBlts and Tilemap Rendering

As my last post indicated (which has been a while now, whoops!), I've been working on optimizing the bitblt routines in libDGL. I'm definitely no master of optimization and am not expecting to come up with anything revolutionary. In fact I'm sure I will goof things up a fair bit in the process and miss obvious avenues of optimization, heh. Every little bit of speed counts for the hardware I'm targeting with this project (486's and maybe 386's later on). Figured I'd share the results of some of my latest attempts, including the parts that didn't result in improvements.

Recently, I thought it would be a fun idea to do a pretty much 1:1 conversion of some old projects of mine written around the turn of the century. The idea would be to convert them to C and get them up and running with libDGL. The code in these projects was pretty terrible and I've no long-term intentions of extending them. However, I figured it would be interesting mainly just to see how fast I could get them running.

Even back then, I liked using obsolete development tools like QuickBASIC 4.5 which by 2000/2001 was definitely obsolete. At that time I was writing code on a AMD Duron 800MHz PC, so I was never too concerned with performance, even with QuickBASIC. Running this old QuickBASIC code today as-is on my 486 DX2-66, I see that it barely maintains 30-35 FPS. This code was built using DirectQB 1.61 (a great library for the time, used by a lot of games) with some improved tile/sprite bitblt routines by Rel since the original routines in DirectQB were known to be kind of slow.

Of course, just by doing a straight-up conversion to C we will get significant performance gains. QuickBASIC isn't exactly an optimizing compiler, heh. But I wanted to see just how fast we could get the basic tilemap rendering going.

Here's the original QuickBASIC code for drawing the tilemap:

SUB DrawMap  
  'Get camera position
  CameraX = Engine.x - 160
  CameraY = Engine.y - 96

  'Make sure we aren't going to go off the map buffer
  F CameraX < 0 THEN CameraX = 0
  IF CameraY < 0 THEN CameraY = 0
  IF CameraX > Engine.MaxX - 320 THEN CameraX = Engine.MaxX - 320
  IF CameraY > Engine.MaxY - 200 THEN CameraY = Engine.MaxY - 200

  'Get the starting tile to draw at
  xTile = CameraX \ 16
  yTile = CameraY \ 16

  'Get the pixel offset to draw at
  xpos = CameraX MOD 16
  ypos = CameraY MOD 16

  'Now actually draw the map
  FOR x = 0 TO 21
    FOR y = 0 TO 14
      'Get the tile numbers to draw
      tile = map(x + xTile, y + yTile).tile
      tile2 = map(x + xTile, y + yTile).tile2

      xp = x * 16 - xpos
      yp = y * 16 - ypos

      'Draw the first layer
      RelSpriteSolid 1, xp, yp, VARSEG(tilearray(0, tile)), VARPTR(tilearray(0, tile))

      'Draw the second layer only if the tile isn't equal to 0
      '(Tile 0 should be a blank tile, in which case we don't want to draw
      'it because it'll slow the engine down)
      IF tile2 <> 0 THEN
        RelSprite 1, xp, yp, VARSEG(tilearray(0, tile2)), VARPTR(tilearray(0, tile2))
      END IF
    NEXT y
  NEXT x
END SUB  

So after doing a bunch of straightforward converting and getting the basic engine up and running, I ended up with the following C function. This is, for now, intended to be a 1:1 equivalent of the above code:

#define SCREEN_X_TILES  21
#define SCREEN_Y_TILES  14

#define TILE_WIDTH      16
#define TILE_HEIGHT     16

void draw_map(void) {  
    int x_tile, y_tile;
    int x_offs, y_offs;
    int x, y, index;
    int xp, yp;
    byte tile1, tile2;
    SURFACE **tileset;

    // get camera position
    engine->camera_x = engine->x - 160;
    engine->camera_y = engine->y - 96;

    // make sure we aren't going to go off the map buffer
    if (engine->camera_x < 0)
        engine->camera_x = 0;
    if (engine->camera_y < 0)
        engine->camera_y = 0;
    if (engine->camera_x > engine->width - 320)
        engine->camera_x = engine->width - 320;
    if (engine->camera_y > engine->height - 200)
        engine->camera_y = engine->height - 200;

    // get the starting tile to draw at
    x_tile = engine->camera_x / TILE_WIDTH;
    y_tile = engine->camera_y / TILE_HEIGHT;

    // get the pixel offset to draw at
    x_offs = engine->camera_x % TILE_WIDTH;
    y_offs = engine->camera_y % TILE_HEIGHT;

    // now actually draw the map
    tileset = map_tiles->tiles;
    for (y = 0; y < SCREEN_Y_TILES; ++y) {
        for (x = 0; x < SCREEN_X_TILES; ++x) {
            index = (((y + y_tile) * map->width) + (x + x_tile)) * 2;
            tile1 = map->tiledata[index];
            tile2 = map->tiledata[index + 1];

            xp = x * TILE_WIDTH - x_offs;
            yp = y * TILE_HEIGHT - y_offs;

            surface_blit(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
        }
    }
}

Definitely room for improvement here.

So how fast is it? Well, I was impressed actually. Mostly. There are two scenarios that are important to benchmark:

Fast scenario
Slow scenario

Just ignore the obviously ripped graphics. Younger-me couldn't draw pixel art, but I sure knew how to rip graphics from SNES ROMs! There was some great DOS-based tool I remember I really liked for doing it, but unfortunately I cannot recall the name of it. Anyway, Today-me still doesn't know how to draw pixel art, so we'll just continue using these ripped graphics for now.

So the difference between these two scenarios is the amount of "layer 2" tiles being drawn. You can see in the above render loop code that a check is done for a non-zero tile2 value and then a call to surface_blit_sprite is done. This of course draws the second layer tile with transparency to overlay tiles on top of the lower layer. As you can see in the right screenshot, there are a ton of tree tiles being drawn and with the map that the engine is using for this test, these are all on layer two (with layer one just being a simple grass tile).

Let's get the blatantly obvious problem out of the way first: There is NO need for the map to be laid out this way. We don't even need to touch code to see significant improvements. We just need to adjust the tileset and map so that we can skip the vast majority of the layer 2 tiles. If we needed a bunch of variations of tree tiles with different ground underneath them, then we might very well be better served with a bunch of different tree tiles in the tileset we use, each with the different ground needed. However, this map was what I put together in 2001 or so. For now we'll just stick with it and see what else we can do.

Also, I should mention that these two screenshots show the frame-rates at double-word aligned memory offsets. We'll come back to this later though as memory alignment is obviously an important topic.

So, the first thing that came to mind when looking at my freshly converted draw_map() function was to eliminate the unnecessary clipping checks for 90% of the tiles being drawn in this render loop. Only tiles on the edge of the screen actually need to be checked for clipping. We can do this very simply for now:

if (x > 0 &&  
    y > 0 && 
    x < (SCREEN_X_TILES - 1) && 
    y < (SCREEN_Y_TILES - 2)) {
    surface_blit_f(tileset[tile1], backbuffer, xp, yp);
    if (tile2)
        surface_blit_sprite_f(tileset[tile2], backbuffer, xp, yp);
} else {
    surface_blit(tileset[tile1], backbuffer, xp, yp);
    if (tile2)
        surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
}

surface_blit_f and surface_blit_sprite_f are "fast" versions that skip clipping checks, but otherwise work exactly the same (in fact, surface_blit and surface_blit_sprite call the fast versions internally).

This gets us a little bit of an improvement. 147/148 FPS in the fast scenario, and 97/98 FPS in the slow scenario.

It's important to note here that in VGA mode 13h, the screen resolution is 320x200. We're using 16x16 tiles in our tilemap, so we end up needing to draw a minimum of 20 tiles horizontally, and 13 tiles vertically (where the last row at the bottom will only be half visible, since 200/16 = 12.5). In order to do pixel-by-pixel scrolling as in this tilemap rendering engine, we need to add one extra column and row to ensure we don't have any gaps anywhere along the edges at any point as the screen is scrolling. Because of the uneven vertical tile count (12.5) we actually need to do clipping for the bottom two rows.

We could of course add a check for y_offs >= 8 to determine if we actually need to draw that last row at all. Though this obviously won't always improve performance, it would depend on how the screen is currently scrolled.

Anyway, the next thought I had was to improve the way that the map data was being accessed inside the loop. I didn't figure that this would make a big difference, but let's see:

// now actually draw the map
index = (y_tile * map->width) + x_tile;  
tiledata = &map->tiledata[index * 2];  
tileset = map_tiles->tiles;

xp = -y_offs;  
yp = -x_offs;

for (y = 0; y < SCREEN_Y_TILES; ++y) {  
    for (x = 0; x < SCREEN_X_TILES; ++x) {
        tile1 = tiledata[0];
        tile2 = tiledata[1]);

        if (x > 0 && 
            y > 0 && 
            x < (SCREEN_X_TILES - 1) && 
            y < (SCREEN_Y_TILES - 2)) {
            surface_blit_f(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite_f(tileset[tile2], backbuffer, xp, yp);
        } else {
            surface_blit(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
        }

        tiledata += 2;
        xp += TILE_WIDTH;
    }

    tiledata += (map->width - SCREEN_X_TILES) * 2;
    yp += TILE_HEIGHT;
    xp = -x_offs;
}

We get a very minor boost from this. 151/152 FPS in the fast scenario, and 98/99 FPS in the slow scenario. But it's something!

Well, now we have that x/y coordinate check that runs every iteration of the loop to see if we need to use clipped blits or not. We know that 90% of the screen does not need clipped blits, so I decided to try spliting up the rendering loop on this basis so that we don't need to run that check all the time. I ended up with 3 separate loops:

  • The top and bottom row (for y = 0, y = 12 and y = 13 only, remember two bottom rows can get clipped)
  • The left and right columns (for x = 0 and x = 20 only)
  • Everything in-between.

In the end this gained me about 1 FPS, but it made the code significantly larger because there were three loops instead of one, and each of these included similar sets of calculations (but different enough that I did need to have three sets of them). Probably could have cleaned the code up a fair bit, but ultimately since this all made a tiny difference, I decided that the extra complexity just wasn't worth keeping. Perhaps I would want to revisit this later on once I see how this runs on a 386 CPU.

Next, I decided to take a closer look at what the actual surface_blit and surface_blit_sprite calls do. I've already spent a bunch of time trying to optimize them and I'm sure there's still some stuff that can be done, but I'll start off with them as they are today and not the much slower versions I had written a few months ago.

static void surface_blit(const SURFACE *src, SURFACE *dest, int x, int y) {  
    surface_blit_region(src, dest, 0, 0, src->width, src->height, x, y);
}

static void surface_blit_sprite(const SURFACE *src, SURFACE *dest, int x, int y) {  
    surface_blit_sprite_region(src, dest, 0, 0, src->width, src->height, x, y);
}

void surface_blit_region(const SURFACE *src,  
                         SURFACE *dest,
                         int src_x,
                         int src_y,
                         int src_width,
                         int src_height,
                         int dest_x,
                         int dest_y) {
    RECT src_region = rect(src_x, src_y, src_width, src_height);
    boolean on_screen = clip_blit(&dest->clip_region, &src_region, &dest_x, &dest_y);

    if (!on_screen)
        return;

    surface_blit_region_f(src, dest,
                          src_region.x, src_region.y,
                          src_region.width, src_region.height,
                          dest_x, dest_y);
}

void surface_blit_sprite_region(const SURFACE *src,  
                                SURFACE *dest,
                                int src_x,
                                int src_y,
                                int src_width,
                                int src_height,
                                int dest_x,
                                int dest_y) {
    RECT src_region = rect(src_x, src_y, src_width, src_height);
    boolean on_screen = clip_blit(&dest->clip_region, &src_region, &dest_x, &dest_y);

    if (!on_screen)
        return;

    surface_blit_sprite_region_f(src, dest,
                                 src_region.x, src_region.y,
                                 src_region.width, src_region.height,
                                 dest_x, dest_y);
}

Alright, so as we can see, these aren't super interesting and we might as well just look at surface_blit_region_f and surface_blit_sprite_region_f. We could likely optimize clip_blit. In fact, I don't even think I've tried to do this at all ever. The existing implementation is a copy of some code I had written over 10 years ago if I recall correctly, heh. However, let's just ignore it for now since the current tilemap function we have skips clipping for probably 90% of the tiles being drawn.

static int surface_offset(const SURFACE *surface, int x, int y) {  
    return (surface->width * y) + x;
}

static byte* surface_pointer(const SURFACE *surface, int x, int y) {  
    return surface->pixels + surface_offset(surface, x, y);
}

void surface_blit_region_f(const SURFACE *src,  
                           SURFACE *dest,
                           int src_x,
                           int src_y,
                           int src_width,
                           int src_height,
                           int dest_x,
                           int dest_y) {
    const byte *psrc;
    byte *pdest;
    int lines;
    int src_y_inc = src->width - src_width;
    int dest_y_inc = dest->width - src_width;
    int width_4, width_remainder;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    lines = src_height;

    width_4 = src_width / 4;
    width_remainder = src_width & 3;

    if (width_4 && !width_remainder) {
        // width is a multiple of 4 (no remainder)
        direct_blit_4(width_4, lines, pdest, psrc, dest_y_inc, src_y_inc);

    } else if (width_4 && width_remainder) {
        // width is >= 4 and there is a remainder ( <= 3 )
        direct_blit_4r(width_4, lines, width_remainder, pdest, psrc, dest_y_inc, src_y_inc);

    } else {
        // width is <= 3
        direct_blit_r(width_remainder, lines, pdest, psrc, dest_y_inc, src_y_inc);
    }
}

I talked about this in my last post, but to recap, the idea here is that I figured there were three main scenarios for bitblts (post-clipping of course):

  • The width of the blit is an even multiple of 4. In this case, we can simply do a rep movsd' for each row. Very nice and efficient.
  • The width is some value larger than 4, but it is not an even multiple of 4 so we can split each row into a rep movsd followed by a rep movsb.
  • The width is some value < 4. We can just do a rep movsb.

The most common scenario when dealing with "typical" game graphics would be the first scenario. In our case, we are using 16x16 tiles and sprites, so definitely this will be the case for us. The remaining two scenarios would primarily occur for partially clipped blits, so these two would not be what would get run for the vast majority of blits.

It's worth pointing out that I didn't start with a blit function implementation that had these three scenarios. I started with a simple one that just did rep movsb for each row of pixels. Once I saw how that performed I then thought about it and came up with the three scenario idea. I then saw that, indeed, this way performed much better. Having said that, I'm sure it's been implemented better by smarter people then me decades earlier.

The direct_blit_xxxx calls are implemented in assembly and are relatively simple:

void direct_blit_4(int width4,  
                   int lines,
                   byte *dest,
                   const byte *src,
                   int dest_y_inc,
                   int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw all 4-pixel runs (dwords)
        rep movsd

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_4r(int width4,  
                    int lines,
                    int remainder,
                    byte *dest,
                    const byte *src,
                    int dest_y_inc,
                    int src_y_inc) {
    _asm {
        mov edi, ecx             // dest pixels
        mov esi, src             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // ebx = remaining number of pixels
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw all 4-pixel runs (dwords)
        rep movsd
        mov ecx, ebx             // draw remaining pixels ( <= 3 bytes )
        rep movsb

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_r(int width,  
                   int lines,
                   byte *dest,
                   const byte *src,
                   int dest_y_inc,
                   int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of pixels to draw (bytes)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw pixels (bytes)
        rep movsb

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

Some compiler/toolchain-related things to note here first before going on:

  • I'm using Watcom C 11.0 for this. Thus, I can make use of the nice _asm block inline assembly support. However, as I demonstrated in my previous post, I had run into what looked like a compiler bug with Watcom's _asm support. After playing with it some more, I noticed I got good results by just moving all my _asm blocks to their own functions. This is kinda-sorta-maybe in some ways like using externally linked assembly functions via something like TASM or MASM. At least, as far as it just being a function call instead of slapped somewhere else inline in some block of C code. Kind of a nice separation, especially for these blit functions and it both fixes the problem I had run into and means I don't have to worry much about setting up a proper calling convention in whatever assembler I would otherwise be using (which would involve some icky #pragma usage). Which leads me into the next point ...
  • Watcom's default calling convention uses registers for the first 4 arguments (well, at least for 32-bit values). eax, edx, ebx, and ecx in that order. For any remaining arguments, the stack is used as per normal C calling convention. Because I'm using a pattern of putting my larger _asm blocks in non-static functions all by themselves, I can "hijack" this calling convention easily enough and skip some additional stack copying that Watcom would generate if I used the argument variable names for those first 4 arguments. This is probably kind of a dirty hack, but it seems to work well, and that's why you'll notice in these assembly functions that I don't seem to ever reference the first 4 arguments. I do, but they are already in registers.

I'm going to go ahead and claim that these are good enough. I mean, they've basically been reduced to a loop of rep movsd's in the best and by far most common case. I don't think it gets much better than that. I'm sure there's some optimization guru reading this that is face-palming after having read that statement and noticed some dumb thing I did in the code, but hey, I did say above that I'm no expert!

I did play with using ebp as well in direct_blit_4 since that function is called the most out of the three. Using ebp as a general purpose register in that function allows me to not use any values from the stack at all once inside the loop, so I figured it would be worthwhile. But it barely made any noticeable difference on my 486. It would probably be worth testing on a 386 though to see the difference, but I'll wait until I have an actual 386 to test with. For now, I decided to leave this optimization out. Using ebp in this way is tricky, as once you change it like this you cannot access anything using your variable names (the compiler, or assembler, replaces them with addresses relative to ebp). You could still access the stack using addresses relative to esp, but I decided not to go down this road at this time.

Alright, well, since the slowdowns really seem to be regarding the surface_blit_sprite calls in our draw_map() function, let's look at it. The core of it (as indicated by code shown previously) is implemented in surface_blit_sprite_region_f:

void surface_blit_sprite_region_f(const SURFACE *src,  
                                  SURFACE *dest,
                                  int src_x,
                                  int src_y,
                                  int src_width,
                                  int src_height,
                                  int dest_x,
                                  int dest_y) {
    const byte *psrc;
    byte *pdest;
    byte pixel;
    int src_y_inc, dest_y_inc;
    int width, width_4, width_8, width_remainder;
    int lines_left;
    int x;

    psrc = (const byte*)surface_pointer(src, src_x, src_y);
    src_y_inc = src->width;
    pdest = (byte*)surface_pointer(dest, dest_x, dest_y);
    dest_y_inc = dest->width;
    width = src_width;
    lines_left = src_height;

    src_y_inc -= width;
    dest_y_inc -= width;

    width_4 = width / 4;
    width_remainder = width & 3;

    if (width_4 && !width_remainder) {
        if ((width_4 & 1) == 0) {
            // width is actually an even multiple of 8!
            direct_blit_sprite_8(width_4 / 2, lines_left, pdest, psrc, dest_y_inc, src_y_inc);
        } else {
            // width is a multiple of 4 (no remainder)
            direct_blit_sprite_4(width_4, lines_left, pdest, psrc, dest_y_inc, src_y_inc);
        }

    } else if (width_4 && width_remainder) {
        if ((width_4 & 1) == 0) {
            // width is _mostly_ made up of an even multiple of 8,
            // plus a small remainder
            direct_blit_sprite_8r(width_4 / 2, lines_left, pdest, psrc, width_remainder, dest_y_inc, src_y_inc);
        } else {
            // width is >= 4 and there is a remainder
            direct_blit_sprite_4r(width_4, lines_left, pdest, psrc, width_remainder, dest_y_inc, src_y_inc);
        }

    } else {
        // width is <= 3
        direct_blit_sprite_r(width_remainder, lines_left, pdest, psrc, dest_y_inc, src_y_inc);
    }
}

Immediately, we can see it's a bit different from surface_blit_region_f, but at its core it's the same basic three-scenario implementation. It's just been extended to also look for an even multiple of 8 and to call a slightly different assembly function for those cases.

Again, this was something that I initially didn't do for the first implementation of this function. I originally had a simple loop (written entirely in C code) that checked one pixel each iteration and if non-zero would draw it. Nice and simple. I then decided to try unrolling the loop and do 4 pixels per iteration, still only in C code. This gave a significant improvement, so I decided to try adding an extra 8-pixel-per-iteration version and saw an improvement again but not as significant this time. Still, it was enough that I thought it warranted keeping it.

Watcom's optimizer actually did a pretty damn good job with my C code version. I was able to tweak it slightly to help the optimizer out, but eventually decided that it was probably better to write it in assembly anyway. This is because it seemed rather easy to make what seemed like an extremely minor change to the code that would result in the optimizer generating some real inefficient block(s) of code.

Anyway, here are the direct_blit_sprite_xxxx functions:

void direct_blit_sprite_4(int width4,  
                          int lines,
                          byte *dest,
                          const byte *src,
                          int dest_y_inc,
                          int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:

    start_4_run:
        mov ecx, eax             // ecx = counter of 4-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz end_4_run
        mov [edi+3], bl
    end_4_run:
        add esi, 4               // move src and dest up 4 pixels
        add edi, 4
        dec ecx                  // decrease 4-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_4r(int width4,  
                           int lines,
                           byte *dest,
                           const byte *src,
                           int remainder,
                           int dest_y_inc,
                           int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:

    start_4_run:                 // draw 4-pixel runs first
        mov ecx, eax             // ecx = counter of 4-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz end_4_run
        mov [edi+3], bl
    end_4_run:
        add esi, 4               // move src and dest up 4 pixels
        add edi, 4
        dec ecx                  // decrease 4-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    start_remainder_run:         // now draw remaining pixels ( <= 3 pixels )
        mov ecx, remainder       // ecx = counter of remaining pixels

    draw_pixel:
        mov bl, [esi]            // load pixel
        inc esi
        test bl, bl              // if zero, skip to next pixel
        jz end_pixel
        mov [edi], bl            // else, draw pixel
    end_pixel:
        inc edi
        dec ecx
        jnz draw_pixel           // keep drawing pixels while there's still more

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_r(int width,  
                          int lines,
                          byte *dest,
                          const byte *src,
                          int dest_y_inc,
                          int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 4-pixel runs (dwords)
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:
        mov ecx, eax             // ecx = counter of remaining pixels

    draw_pixel:
        mov bl, [esi]            // load pixel
        inc esi
        test bl, bl              // if zero, skip to next pixel
        jz end_pixel
        mov [edi], bl            // else, draw pixel
    end_pixel:
        inc edi
        dec ecx
        jnz draw_pixel           // loop while there's still pixels left

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_8(int width8,  
                          int lines,
                          byte *dest,
                          const byte *src,
                          int dest_y_inc,
                          int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 8-pixel runs
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:
        mov ecx, eax             // ecx = counter of 8-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz draw_px_4
        mov [edi+3], bl
    draw_px_4:
        mov bl, [esi+4]
        test bl, bl
        jz draw_px_5
        mov [edi+4], bl
    draw_px_5:
        mov bl, [esi+5]
        test bl, bl
        jz draw_px_6
        mov [edi+5], bl
    draw_px_6:
        mov bl, [esi+6]
        test bl, bl
        jz draw_px_7
        mov [edi+6], bl
    draw_px_7:
        mov bl, [esi+7]
        test bl, bl
        jz end_8_run
        mov [edi+7], bl
    end_8_run:
        add esi, 8               // move src and dest up 8 pixels
        add edi, 8
        dec ecx                  // decrease 8-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_sprite_8r(int width8,  
                           int lines,
                           byte *dest,
                           const byte *src,
                           int remainder,
                           int dest_y_inc,
                           int src_y_inc) {
    _asm {
        mov edi, ebx             // dest pixels
        mov esi, ecx             // source pixels

        // eax = number of 8-pixel runs
        // edx = line loop counter

        test edx, edx            // make sure there is >0 lines to be drawn
        jz done

    draw_line:

    start_8_run:                 // draw 8-pixel runs first
        mov ecx, eax             // ecx = counter of 8-pixel runs left to draw
    draw_px_0:
        mov bl, [esi+0]          // load src pixel
        test bl, bl
        jz draw_px_1             // if it is color 0, skip it
        mov [edi+0], bl          // otherwise, draw it onto dest
    draw_px_1:
        mov bl, [esi+1]
        test bl, bl
        jz draw_px_2
        mov [edi+1], bl
    draw_px_2:
        mov bl, [esi+2]
        test bl, bl
        jz draw_px_3
        mov [edi+2], bl
    draw_px_3:
        mov bl, [esi+3]
        test bl, bl
        jz draw_px_4
        mov [edi+3], bl
    draw_px_4:
        mov bl, [esi+4]
        test bl, bl
        jz draw_px_5
        mov [edi+4], bl
    draw_px_5:
        mov bl, [esi+5]
        test bl, bl
        jz draw_px_6
        mov [edi+5], bl
    draw_px_6:
        mov bl, [esi+6]
        test bl, bl
        jz draw_px_7
        mov [edi+6], bl
    draw_px_7:
        mov bl, [esi+7]
        test bl, bl
        jz end_8_run
        mov [edi+7], bl
    end_8_run:
        add esi, 8               // move src and dest up 8 pixels
        add edi, 8
        dec ecx                  // decrease 8-pixel run loop counter
        jnz draw_px_0            // if there are still more runs, draw them

    start_remainder_run:         // now draw remaining pixels ( <= 7 pixels )
        mov ecx, remainder       // ecx = counter of remaining pixels

    draw_pixel:
        mov bl, [esi]            // load pixel
        inc esi
        test bl, bl              // if zero, skip to next pixel
        jz end_pixel
        mov [edi], bl            // else, draw pixel
    end_pixel:
        inc edi
        dec ecx
        jnz draw_pixel           // loop while there's still pixels left

    end_line:
        add esi, src_y_inc       // move src and dest to start of next line
        add edi, dest_y_inc
        dec edx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

As you may have been able to imagine before even seeing this code, these are very much implemented in the same general way as the non-transparent blits are. Instead of rep movsd and we have unrolled loops that must check each and every pixel for transparent pixels before drawing. Same for rep movsb, except the loops that replace these aren't unrolled.

After initially writing these assembly sprite blit routines, I started to look for ways that I might make the unrolled loop section faster. I began by trying to reduce the number of memory reads by reading 16-bits instead of 8-bits at a time. Then I would only need to read pixels every other time and could access two pixels by using bl and bh.

draw_px_0:  
    mov bx, [esi+0]          ; load two pixels at once
    test bl, bl
    jz draw_px_1             ; if this pixel is color 0, skip it
    mov [edi+0], bl          ; otherwise, draw it onto dest
draw_px_1:  
    test bh, bh              ; don't need to read, second pixel is already in bh
    jz draw_px_2
    mov [edi+1], bh

On my 486 this resulted in barely any noticeable difference. Maybe 1 FPS of an improvement for the slow scenario. I don't have a 386 to test with, but from looking at some instruction timing information, I'm guessing this should be an improvement on a 386 processor. mov reg, mem takes 4 clock cycles on a 386, versus 1 clock cycle on a 486, so by removing some of these operations we should save some time anyway.

The other thing I was curious about was using bswap. This instruction was added starting with 486 processors, so if I wanted to support 386's I couldn't use it anyway, but even still, I just wanted to try it. What bswap does is reverse the byte order of a 32-bit register. This can be used as a means to access the upper 16-bits of a 32-bit register, which you otherwise wouldn't be able to do independently since x86 architecture doesn't provide you with any 8/16 bit registers for the upper half. With this in mind I figured I would be able to do something like:

draw_px_0:  
    mov ebx, [esi+0]         ; load 4 src pixels
    test bl, bl
    jz draw_px_1             ; if it is color 0, skip it
    mov [edi+0], bl          ; otherwise, draw it onto dest
draw_px_1:  
    test bh, bh              ; don't need to read, second pixel is already in bh
    jz draw_px_2
    mov [edi+1], bh
draw_px_2:  
    bswap ebx                ; swap bytes. bh now has this pixel, bl is the next
    test bh, bh
    jz draw_px_3
    mov [edi+2], bh
draw_px_3:  
    test bl, bl
    jz draw_px_4
    mov [edi+3], bl

However to my surprise, this made no noticeable difference again. Anyway, doesn't really matter much since I could not use this on a 386. Using shr as an alternative method to access the upper 16 bits is no good. It's too expensive to use for something like this. shr reg, imm is 2 clock cycles on a 486 and 3 clock cycles on a 386, whereas bswap runs in only 1 cycle.

There might be some other improvements that can be made here, but nothing came to mind so I figured I'd move on.

Looking back at my draw_map() function, I figured why not call the direct_blit_xxxx and direct_blit_sprite_xxxx assembly functions directly? We can't do that for the tiles around the edges of the screen that need to be clipped, but we should absolutely be able to do this for the inner 90% of the tiles that are drawn. As an added benefit, since we're using 16x16 tiles, we know that we can always just call direct_blit_4 and direct_blit_sprite_8. All we need to do is manage all the source and destination memory parameters ourselves directly instead of x and y coordinates.

Probably won't be a large boost, but we'll see.

// now actually draw the map
index = (y_tile * map->width) + x_tile;  
tiledata = &map->tiledata[index * 2];  
tileset = map_tiles->tiles;

yp = -y_offs;  
xp = -x_offs;

pdest = surface_pointer(backbuffer, xp, yp);

for (y = 0; y < SCREEN_Y_TILES; ++y) {  
    for (x = 0; x < SCREEN_X_TILES; ++x) {
        tile1 = tiledata[0];
        tile2 = tiledata[1];

        if (x > 0 &&
            y > 0 &&
            x < (SCREEN_X_TILES - 1) &&
            y < (SCREEN_Y_TILES - 2)) {
            direct_blit_4(TILE_WIDTH / 4,
                          TILE_HEIGHT,
                          pdest,
                          tileset[tile1]->pixels,
                          320 - TILE_WIDTH,
                          0);
            if (tile2)
                direct_blit_sprite_8(TILE_WIDTH / 8,
                                     TILE_HEIGHT,
                                     pdest,
                                     tileset[tile2]->pixels,
                                     320 - TILE_WIDTH,
                                     0);
        } else {
            surface_blit(tileset[tile1], backbuffer, xp, yp);
            if (tile2)
                surface_blit_sprite(tileset[tile2], backbuffer, xp, yp);
        }

        tiledata += 2;
        xp += TILE_WIDTH;
        pdest += TILE_WIDTH;
    }

    tiledata += (map->width - SCREEN_X_TILES) * 2;
    yp += TILE_HEIGHT;
    pdest += (TILE_HEIGHT * 320) - (SCREEN_X_TILES * TILE_WIDTH);
    xp = -x_offs;
}

Obviously no need to worry about the multiplications and divisions you now see in the above code. They're all based entirely on constants (except for one * 2 which will become a shl anyway by the optimizer).

At this point, the draw_map() function implementation is starting to look kind of messy to me. However, the improvement this brings is small but noticeable. Up to 159/160 FPS in the fast scenario and 105 FPS in the slow scenario.

The last thing I wanted to look at was if we could help mitigate the performance drop that occurs when the screen is scrolled to some position that results in us drawing all of the visible map tiles at unaligned memory addresses. In 32-bit protected mode (as I am using), we want to be aligned to a 4-byte boundary. So that means that, right now, beginning a blit at three out of every four x coordinate values across the entire width of the screen will make the blit unaligned.

How big a deal is this? Well, if I adjust the x coordinate of each of our two scenarios by one pixel (in either direction), we end up getting 135 FPS in the fast scenario and 94 FPS in the slow scenario. So, it's a big enough deal that we should at least see if we can do something to lessen the blow. This is a pixel-by-pixel scrolling engine after all, so these unaligned offsets will occur frequently.

I was not really too optimistic that I would be able to improve things with my current knowledge/experience. Admittedly, I've never really written code that tries to deal with memory alignment. As I understand it, the main slowdown is on the writes and, indeed, this is where the source of unaligned memory accesses will be for us.

We need not worry about our source tile graphics in this case, as the tiles in a typical tileset (including the one we're using here) will all be some even number like 16x16 or 32x32 and will either have each tile in their own allocated block of memory, or all tiles will be arranged in a grid on some larger allocated block of memory (but in this case, accessing an arbitrary tile in such a grid will result in some 4-byte aligned address anyway). For memory allocations, malloc() is likely ensuring that the allocation is aligned, so no worries there. It's important to note that libDGL's blit routines allow using an arbitrary region as the blit source, and this region could be located at an unaligned address. I suspect that this wouldn't be a common use case, so I chose to ignore it.

So after thinking about it a bit I figured that I would start with trying to optimize surface_blit_region_f first and ignore sprite/transparent blits for now (unsure what I can really do there to be honest).

I decided that I would try calculating the number of bytes at the start of each line in the blit that are before the next 4-byte boundary. Then I could do a rep movsb up to this next boundary. Following this, I could proceed with the remainder of the line like normal (that is, with a single rep movsd or a combo of rep movsd and rep movsb as appropriate based on our existing method).

However, I suspected this would be slower then just not dealing with memory alignment at all. rep movsb takes 12+3n clock cycles on a 486, and we're talking about adding an extra one in some cases. But anyway, I went ahead with it just to see. Here are the modifications I made to surface_blit_region_f:

// ...

psrc = (const byte*)surface_pointer(src, src_x, src_y);  
pdest = (byte*)surface_pointer(dest, dest_x, dest_y);  
lines = src_height;  
bytes_from_boundary = (4 - ((unsigned int)pdest & 3)) & 3;

if (bytes_from_boundary && src_width > 3) {  
    aligned_width = src_width - bytes_from_boundary;

    width_4 = aligned_width / 4;
    width_remainder = aligned_width & 3;

    if (width_4 && !width_remainder) {
        // aligned_width is a multiple of 4 (no remainder)
        direct_blit_u4(bytes_from_boundary, width_4, lines, pdest, psrc, dest_y_inc, src_y_inc);

    } else if (width_4 && width_remainder) {
        // aligned_width is >= 4 and there is a remainder ( <= 3 )
        direct_blit_u4r(bytes_from_boundary, width_4, lines, pdest, psrc, dest_y_inc, src_y_inc, width_remainder);

    } else {
        // aligned_width is <= 3, just take the lazy way out and ignore the fact that
        // this is unaligned
        direct_blit_r(bytes_from_boundary + width_remainder, lines, pdest, psrc, dest_y_inc, src_y_inc);
    }

} else {
    // ...
    // previous code to handle the 3 blit scenarios
    // ...
}

// ...

Two new functions were added, direct_blit_4u and direct_blit_u4r:

void direct_blit_u4(int unaligned_width,  
                    int width4,
                    int lines,
                    byte *dest,
                    const byte *src,
                    int dest_y_inc,
                    int src_y_inc) {
    _asm {
        mov edi, ecx             // dest pixels
        mov esi, src             // source pixels

        // eax = unaligned width
        // edx = number of 4-pixel runs (dwords)
        // ebx = line loop counter

        test ebx, ebx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw initial unaligned pixels ( <= 3 )
        rep movsb
        mov ecx, edx             // draw all 4-pixel runs (dwords)
        rep movsd

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec ebx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

void direct_blit_u4r(int unaligned_width,  
                     int width4,
                     int lines,
                     byte *dest,
                     const byte *src,
                     int dest_y_inc,
                     int src_y_inc,
                     int remainder) {
    _asm {
        mov edi, ecx             // dest pixels
        mov esi, src             // source pixels

        // eax = unaligned width
        // edx = number of 4-pixel runs (dwords)
        // ebx = line loop counter

        test ebx, ebx            // make sure there is >0 lines to draw
        jz done

    draw_line:
        mov ecx, eax             // draw initial unaligned pixels ( <= 3 )
        rep movsb
        mov ecx, edx             // draw all 4-pixel runs (dwords)
        rep movsd
        mov ecx, remainder       // draw remaining pixels ( <= 3 bytes )
        rep movsb

        add esi, src_y_inc       // move to next line
        add edi, dest_y_inc
        dec ebx                  // decrease line loop counter
        jnz draw_line            // keep going if there's more lines to draw

    done:
    }
}

In order to test this out in the draw_map() function, I decided that for now it would be simpler to revert back to just calling the normal surface_blit_xxxx functions. That way I don't need to clutter up that code even more then it already is with calculations to determine what direct_blit_xxxx function needs to be called based on the x coordinate. It'll run a little bit slower this way, but it doesn't matter, I just want to see if it's faster or not.

And as expected, it ended up being slower! By about 20 FPS in the fast scenario and 10/11 FPS in the slow scenario. I would guess that this would get better if I was doing larger blits, but probably with the tile size I am using the cost of an extra rep movsb is just not worth it.

Ultimately what I learned most from this experience is that I need to do more reading up on the subject, heh. I would not be surprised at all if I was missing something obvious here.

On the whole, this entire optimization exercise was useful and even though trying to address memory alignment didn't produce any results, I was able to improve overall performance. It would be interesting to compare results on a 386 at some point too, but that will have to wait.