Fixed Point Math

October 11, 2020 —

So I had actually written a bunch of fixed-point math and support stuff for myself a couple years ago and had figured that I was all done with it. There are many articles and books out there written on the topic, so I do feel like it is very well covered by this point. But I was working on something a couple months ago that highlighted a bit of a gap in my fixed-point math code and noticed that none of the books that I personally owned actually discussed this particular issue, 'nor did any of the articles on the topic that I had bookmarked. Maybe the authors thought that it was "obvious" and didn't need explanation? I dunno.

At any rate, encountering that problem made me decide that I would just go ahead and write an article on fixed-point math with a specific focus on retro-programming (because of course I would do this) and include the solution for the specific issue I had found. Maybe someone else will find this useful.

Floating-Point and x86

Normally in your code, when you want to work with numbers with decimals, such as 3.14159, you would use a floating-point type. For example, in C/C++, this would be float or double. These types use the IEEE-754 representation for holding numbers. Using floating-point data types is of course very common today. You don't even think twice about it when you need to work with non-integer numbers. You just use these types and go on your merry way.

Way back when, CPUs didn't have built-in hardware support for floating-point data types. For example, to get hardware support on x86 CPUs, you needed to have a separate math co-processor, like the Intel 80x87. Having one of these installed would mean that you could use your programming language's built-in floating-point data types, and they would execute floating-point instructions using the math co-processor, which was significantly faster than the alternative of using an emulation library. Many compilers made switching between use of an emulation library or using native 80x87 instructions easy (usually just a compiler switch or something similar... no code changes needed). Of course, if you didn't have a math co-processor installed or you were distributing programs to users who might not have a math co-processor, you had no choice but to rely on some sort of emulation.

The 486 was the first Intel x86 CPU with the math co-processor functionality built in. Well, unless you had an SX, heh. Ok, well at least from the Pentium-onwards, it was all built-in without any Intel-manually-disabling-the-built-in-support-purely-for-money-grubbing-purposes nonsense. Before the 486/Pentium were commonplace, as a software developer, you really couldn't just assume that everyone running your code had a math co-processor available.

As an aside, even today you could still come across modern hardware, such as embedded systems, with no hardware floating-point support. Even back then, once Pentium processors were much more common place in desktop computers, you still had stuff like video game consoles such as the original Playstation, which had no hardware floating-point support, despite the fact that games for that console heavily featured impressive (for the time) 3D graphics. Another somewhat later example of this is the original Nintendo DS.

Enter Fixed-Point

Fixed-point is an alternative to using floating-point that is useful when you need to write code that will run on machines where either no math co-processor will be available, or even if it is, it would be too slow to run the number of floating-point calculations that are going to be needed. Fixed-point will not give you exactly the same levels of precision as floating-point would, but for many purposes, it is "good enough."

Fixed-point works by representing floating-point numbers using otherwise normal integer data types (e.g. in C/C++, using an int or long) and arbitrarily dividing the bits into whole number (integer) and decimal (fractional) parts.

For example, if we imagine the bits in a 32-bit integer:

0000000000101101 0010100011110101
  whole number       decimal

The above shows the 16:16 fixed-point representation of the number 45.16. The high word (16 bits) represents the whole number portion (45), while the lower word represents the decimal (0.16). As you can see, with this "16:16" representation we've decided to place an imaginary decimal point between the high and low 16 bits of a 32-bit integer. This means that the decimal point doesn't move around as it can with real floating-point. Instead, it is fixed in place.

Nothing stops you from shifting the decimal either in your own fixed-point math routines. You might prefer to use say, 24 bits for the whole number portion and only 8 bits for the decimal. It depends on how much precision you need. You could even define multiple different fixed-point types which specify different levels of precision. But the point is that with fixed-point, we're essentially defining a convention for ourselves on where the decimal point will be.

Of course, in order to use some other custom decimal placement, you would need to customize your fixed-point math routines from what I will show in this article.

This 16:16 representation obviously imposes some limits on the maximum values we can hold in such a fixed-point data type. With 16 bits for the whole number (integer) portion, we can only hold values +/- 32767. And with 16 bits for the decimal (fractional) portion, we have a maximum precision of about 0.000015. Note that we're going to be using signed values (because our programming language's normal floating-point data types all allow us to use signed values, and so we want to do the same because that's the whole point of this ... to get a fast equivalent to floating-point, right?). Because we're using 32-bit integers, the sign bit is going to live in the upper 16-bits.

Now, before we go any further ... why exactly is it better to use fixed-point on these older x86 CPUs?

As mentioned previously, if your code is to run on a system that does not have 80x87 instruction support at all, either via a math co-processor or by having built-in support (as in 486DX or later x86 CPUs), then you need to rely on some sort of emulation library that your compiler probably provides for you which is definitely going to run slower than the equivalent hardware support would.

But, just because your 286/386/486-equipped computer has 80x87 support, it doesn't mean that using those instructions is necessarily the best idea if you absolutely need raw speed. Until the Pentium rolled in, 80x87 instructions could be somewhat slow even when the hardware was present. Consider the following instruction timing listings (numbers shown are the number of cycles needed):

80x87 Instruction Timings
Operation 387 486 Pentium
fadd 23-34 8-20 3/1*
fsub 26-37 5-17 3/1*
fmul 29-57 16 3/1*
fdiv 88-91 73 39

* = latency/throughput

x86 Instruction Timings (32-bit register operands)
Operation 386 486 Pentium
add 2 1 1
sub 2 1 1
imul 9-38 12-42 10
mul 9-38 12-42 10
idiv 43 43 46
div 38 40 41

Since fixed-point uses normal integer data types, the fixed-point math we will be using just uses x86 instructions like add, sub, imul, idiv, etc. That is where we will gain speed over using 80x87 instructions or (especially) using emulation libraries.

Note that I've omitted 286 or earlier timings from the above tables. Obviously things get even slower the further back you go. I will not be discussing fixed-point math in detail for x86 CPUs prior to the 386 in this article. On those CPUs you lack 32-bit registers (as well as some of the specific assembly instructions I will be using here), which means that you end up either having to settle for something like 8:8 fixed-point, or to use two 16-bit registers in order to provide support for something like 16:16 fixed-point. This obviously complicates the fixed-point math code, but it is certainly doable.

A Fixed-Point Data Type

Let's start by defining our fixed-point data type. For this article, I'm going to go with a 16:16 fixed-point data type, because that's what I use in my own projects that use fixed-point. Obviously it makes sense to use a 32-bit integer:

typedef int fixed;       // if you're using a 32-bit compiler anyway. 
                         // maybe use int32_t if you have stdint.h

There's our 32-bit fixed-point data type, fixed. Simple. Note that we want to be able to use signed values, as that is usually very useful in applications using floating-point calculations, especially so in games.

Of course, if you were using a 16-bit compiler like Borland Turbo C++, or one of the old DOS Microsoft C/C++ compilers, and you wanted to have a 16:16 fixed-point data type, then you would need to define your fixed type like so:

typedef long fixed;

I use Watcom C/C++ myself, so the rest of this article is going to be oriented towards that compiler.

Converting To And From Fixed-Point

So, obviously, we cannot just do something like this to assign a value to our fixed-point type:

fixed f = 3.14159;

Because fixed is actually an int, so you won't get the right value as the decimal would be dropped. Plus, the integer value that your compiler would assign here will be "wrong" as far as our fixed-point representation is concerned. The bits for the integer 3 would be located in the lower 16 bits which in our 16:16 representation, we've designated as the decimal (fractional) portion of our fixed-point representation.

What we need to do instead is shift the value so the whole number and decimal bits get stored in the right places. This needs to be done differently when assigning integer values to a fixed-point variable then it does when assigning actual floating-point values to a fixed-point variable. This is because you can actually bit-shift an integer value to move the bits to the right spot. But you cannot (meaningfully, at least) bit-shift a floating-point value due to its IEEE representation. Instead, for floating-point values, we just multiply them by the equivalent of the integer bit-shift, and then cast the resulting value to our 32-bit integer type.

Let's start with integer to fixed-point conversions first:

fixed a = 42 << 16;                      // a is now "42.0f" in fixed-point representation

We shift 42 by 16 bits to the left, to place the integer value 42 in the upper 16 bits of the fixed-point value because we've designated the upper 16 bits as the whole number (integer) portion of our fixed-point representation. There is no decimal portion to assign here obviously, so this is all that's needed. The decimal portion that we will be left with here will just be zero, thus the resulting fixed-point equivalent value 42.0f.

The bits of our fixed-point variable a will look like this:

0000000000101010 0000000000000000

Now for floating-point to fixed-point conversions:

fixed b = (fixed)(3.14159f * 65536.0f);  // b is now "3.14159f" in fixed-point representation

Here we multiply 3.14159f by 65536.0f (65536 = 2^16, in other words, the equivalent multiplication to the bit-shift we used for the integer previously) and cast the result to our fixed type (which is just a 32-bit integer). Casting a floating-point value in this way would normally mean dropping the decimal portion, but since we've already shifted the decimal portion up with the multiplication, we don't actually lose it during the cast.

The bits of our fixed-point variable b will look like this:

0000000000000011 0010010000111111

To convert fixed-point numbers back to either integer or floating-point values, we just do the opposite. Here is how you could do a fixed-point to integer conversion:

int i = a >> 16;                     // i is now "42" (converted from "42.0f" fixed-point)

Looking at the previously shown bits for the fixed-point variable a that should be pretty self-explanatory. In this type of conversion, obviously any decimal portion (if there was any) is discarded.

As a little something extra, if we wanted we could do a fixed-point to integer conversion that rounds up any decimal portion to the next whole number before the doing the right-shift to turn it back into a normal integer. We could do this by adding the equivalent of 0.5f in fixed-point, which for our 16:16 format is 0x8000.

int i = (a + 0x8000) >> 16;          // if a was "3.5f" fixed-point, i would now be 4
                                     // however, if a was "3.4f" fixed-point, i would now be 3

Now for fixed-point to floating-point conversions. Again, we cannot just do a right-shift. The fixed-point value we have is represented as an integer and to turn it into a real floating-point value we need to get it back into that IEEE representation.

float f = (float)(b / 65536.0f);     // f is now "3.14159f" (converted from "3.14159f" fixed-point)

It is obviously useful to define some helper functions or macros for doing all of these conversions. These are the macros that I use:

#define FTOFIX(f)      ((fixed)((f) * 65536.0f))
#define ITOFIX(i)      ((fixed)((i) << 16))
#define FIXTOF(x)      ((float)((x) / 65536.0f))
#define FIXTOI(x)      ((int)((x) >> 16))

Doing Fixed-Point Math

Addition and Subtraction

Luckily, addition and subtraction can be done as normal. Add and subtract away!

fixed a, b, c;

a = FTOFIX(1.2f);
b = FTOFIX(7.7f);

c = a + b;               // c will be "8.9f" fixed-point
c = a - b;               // c will be "-6.5f" fixed-point

One Little Gotcha

One very important consideration when doing fixed-point math is that you need to make sure that you are not unintentionally mixing fixed-point and non-fixed-point values together within the same calculations.

Convert as necessary to make sure you're not mixing them together! If you want a fixed-point result, make sure all the values being used in the calculation are fixed-point. If you want an integer result, make sure any fixed-point values used in the calculation are converted to integers. Same thing goes if you want a floating-point result.

Here's an example to illustrate this point.

int a;
fixed b, c;

a = 10;
b = ITOFIX(20);      // b = "20.0f" in fixed-point

c = b + a;

What is the result left in c? It's not going to be fixed-point 30.0f. It's actually going to be the fixed-point equivalent of 20.000153f.

Why is that? Because b is 20.0f in fixed-point. Meaning that the whole number portion 20 is shifted to the left so that it's in the high 16 bits of b. Then we add a to it. But a is just the plain integer value 10. Because it is just a normal integer value, that means its bits are located in the lower 16 bits of that variable. Adding a and b together has the probably-unintended consequence of treating a as if it were a really, really small fixed-point value (equivalent to a really small fraction in floating-point). When the addition is performed, the value 10 ends up in the resulting fixed-point value c's lower 16 bits, which is where the decimal portion of the fixed-point value lives in our 16:16 representation, while the 20 will be left in c's higher 16-bits.

To hopefully more clearly illustrate, let's look at the bits for a and b after their initial values are assigned:

a = 0000000000000000 0000000000001010
b = 0000000000010100 0000000000000000

As you can probably imagine after seeing this, after the addition is performed the bits in c will look like this:

c = 0000000000010100 0000000000001010

Of course there are a couple ways to fix this. The way to fix this if a really did need to be defined as a normal int, and we really did want to add that integer value to the fixed-point value b, we would need to rewrite the addition so that all of the values being added together are being added as fixed-point values by just doing an integer-to-fixed-point conversion on the spot for a:

c = b + ITOFIX(a);

Which would give us the value we expected, 30.0f in fixed-point.

Multiplication and Division

Multiplication and division can be done using normal means, but with some extra tricks being needed.


Let's start with a simple multiplication and see what we're dealing with.

fixed a, b, c;

a = ITOFIX(2);
b = ITOFIX(4);
c = a * b;         // will this be "8.0f" in fixed-point?

What ends up in c? Zero! What!? Let's take a look at the bits of our fixed-point values a and b.

a = 0000000000000010 0000000000000000
b = 0000000000000100 0000000000000000

These two values should make sense to you now. The actual underlying 32-bit integer values being multiplied here are 131072 and 262144, the result of which is 34359738368, which is huge. Too big for a 32-bit integer to hold. Our C compiler is going to compile this multiplication using an imul instruction which places the resulting 64-bit value in two registers, the high double-word in edx and the low double-word in eax. 34359738368 shown as 64-bits looks like this:

0000000000000000 0000000000001000 0000000000000000 0000000000000000
|---- high double-word (edx) ---| |---- low double-word (eax) ----|

The compiler just ended up returning the low double-word value (which is 0) because our code only specified a 32-bit variable to hold the result.

But hold on, why is fixed-point 2.0 multiplied by fixed-point 4.0 such a gigantic result? It's because in order to represent our fixed-point values, we are effectively multiplying them individually by 2^16 = 65536 (left-shift of 16) before using them in the multiplication. When we then multiply them together, this "scaling" of each of the numbers gets factored into the result. The result we got, 34359738368, is actually 8 * 2^32 (essentially this result is in 32:32 fixed-point representation). But we expected to get 8 * 2^16 (16:16 representation) for our fixed-point result.

So you can probably see what the fix here is. We just need to shift the resulting 64-bit value right by the same amount as what our fixed-point shift amount is, 16 in our case for our 16:16 representation. But we have a little wrinkle in our plan still. The result was a 64-bit value spread across two 32-bit registers.

Let's test a little multiplication tweak using our fixed-point values from above and see what we can do:

mov eax, 131072          ; "2.0f" fixed-point represented as a 32-bit int
mov ebx, 262144          ; "4.0f" fixed-point represented as a 32-bit int
imul ebx
shrd eax, edx, 16        ; right-shift the entire 64-bit value edx:eax back by 16 bits into eax

After the shrd instruction, we will have the following bits in eax:

0000000000001000 0000000000000000

Which is 524288, which is 8 * 2^16 ... which is the exact fixed-point value (8.0f) we expected to get from our multiplication. Phew!

Now that we know how to deal with this, we can turn this bit of assembly into a function. With Watcom C/C++:

fixed fix_mul(fixed a, fixed b);
#pragma aux fix_mul =      \
    "imul ebx"             \
    "shrd eax, edx, 16"    \
    parm [eax] [ebx]       \
    modify [eax ebx edx]   \
    value [eax];

// let's try this again ...

fixed a, b, c;

a = ITOFIX(2);
b = ITOFIX(4);
c = fix_mul(a, b);         // this will now be "8.0f" in fixed-point!

// negative numbers work too of course

a = FTOFIX(2.5f);
b = FTOFIX(-6.3f);
c = fix_mul(a, b);         // will be "-15.75f" in fixed-point


Division is similarly tricky, but let's walk through it so that we understand exactly what's going on.

Let's start with a simple multiplication and see what we're dealing with.

fixed a, b, c;

a = FTOFIX(4.8f);
b = FTOFIX(2.4f);
c = a / b;         // will this be "2.0f" in fixed-point?

What ends up in c this time? 2! Yay, that was easy! Wait... no, that's wrong! We expected to get 2.0f in fixed-point, not 2 as a plain ol' integer!

The problem here is basically the reverse of what it was with multiplication. Those of you reading this with math degrees probably already expected this. But alas, I am a bit of a math dummy, so all these simple concepts I learnt back in school I had to re-learn again here while exploring fixed-point math.

Our two fixed-point values a and b have been pre-multiplied (left-shifted) in order to turn them into 16:16 fixed-point values. When dividing, this multiplication factor present in both of the values gets cancelled out, and so the divide basically un-shifts our fixed-point values back into integers and divides those integers all at once.

Probably the easiest fix here is to convert the dividend, a, into 32:32 fixed-point using a 64-bit value across edx:eax before the division. This would be un-coincidentally convenient as when using idiv with a 32-bit divisor as we are, it is actually dividing the 64-bit value in edx:eax anyway. The result will be a 32-bit value in eax.

Now, we cannot just left-shift our a value by 16 so that it sits across edx:eax because our fixed-point values are signed. We need to sign-extend our 32-bit value a across 64-bits before we do the left-shift that is necessary to convert our 16:16 fixed-point representation value in a so that it is in proper 32:32 format. And of course, that left-shift needs to occur across eax and into edx. The cdq instruction will do the sign-extension into 64-bits for us, while shld will handle the left-shift into edx from eax.

Let's test this out quickly with a bit of assembly:

mov eax, 0x4cccc      ; our dividend value "4.8f" fixed-point represented as a 32-bit int in hex
mov ebx, 0x26666      ; the divisor value "2.4f" fixed-point represented as a 32-bit int in hex
cdq                   ; sign extend eax into edx:eax
shld edx, eax, 16     ; shift edx left by 16, using bits from eax to fill in on the right of 
                      ; edx on each shift
shl eax, 16           ; shld leaves eax unmodified though. so shift eax left by 16 also to 
                      ; finish fixing edx:eax
idiv ebx

To understand this a bit better, lets look at the bits of our dividend value 4.8f fixed-point as we're transforming it here before the idiv is finally executed.

After the value is initially set, eax has:

eax = 0000000000000100 1100110011001100

After the cdq is executed, we end up with the following 64-bit value across edx:eax:

edx = 0000000000000000 0000000000000000
eax = 0000000000000100 1100110011001100

Because our dividend value in eax is positive, the sign bit value of 0 is extended all the way through edx. If the dividend value was negative, we would expect to see a bunch of 1s extended all the way through edx. Next we begin shifting our dividend value into 32:32 fixed-point representation with the shld:

edx = 0000000000000000 0000000000000100
eax = 0000000000000100 1100110011001100

shld leaves the second register operand, in this case eax, unaltered. So, we have our edx value shifted in from the high 16 bits of eax now. So edx is now all good for 32:32 fixed-point representation. The whole number portion 4 is there. Now we need to fix eax by shifting it left 16 as well using shl.

edx = 0000000000000000 0000000000000100
eax = 1100110011001100 0000000000000000

Great! Now edx has the correct high 32-bits representing the whole number portion of our original 16:16 fixed-point value, and eax has the correct low 32-bits representing the decimal portion. Sure, the left-shift on eax ended up leaving a bunch of extra zeros in the low 16-bits, but we didn't have any additional bits from our original value anyway, so this is fine and won't affect our division at all.

After the idiv executes, we get the following 32-bit result in eax:

eax = 0000000000000010 0000000000000000

Which is exactly what we want, 2.0f as 16:16 fixed-point representation! No more shifting of futzing about with values is needed. We can return this into our fixed result variable as-is.

Phew! Oh my! Now we can package this assembly up into a function too. For Watcom C/C++:

fixed fix_div(fixed a, fixed b);
#pragma aux fix_div =    \
    "cdq"                \
    "shld edx, eax, 16"  \
    "shl eax, 16"        \
    "idiv ebx"           \
    parm [eax] [ebx]     \
    modify [edx]         \
    value [eax];

// let's give it a try ...

fixed a, b, c;

a = FTOFIX(4.8f);
b = FTOFIX(2.4f);
c = fix_div(a, b);         // this will now be "2.0f" in fixed-point!

// and again, negative numbers work too

a = FTOFIX(16.0f);
b = FTOFIX(-2.0f);
c = fix_div(a, b);         // will be "-8.0f" in fixed-point

A Big Division Gotcha

These methods are the usual ones that many people have better written about before me over the past decades. I cannot take credit for any of this. However, there is at least one big gotcha with doing division this way with idiv that I've yet to see a book or article mention that discusses fixed-point math.

Imagine for a moment that you are fiddling around with a ray caster just for fun, and also as a means to give your fixed-point math routines a good work out (as I was at some point). Also imagine that at some point as your ray caster test program is running, it ends up executing a division (as mine was). And finally, let us imagine that the values being divided were, say, 2.0f and 0.000061f fixed-point (as eventually happened to me).

fixed a, b, c;

a = FTOFIX(2.0f);
b = FTOFIX(0.000061f);
c = fix_div(a, b);            // will be "32786.8852f" in fixed-point .... right?

And then imagine that your code fails at the idiv operation run by fix_div with a "divide by zero" error.

First off, what the heck? We aren't dividing by zero here! Our divisor value 0.000061f fixed-point is greater than what we figured our smallest precision was (mentioned near the beginning of this article), so it's probably not anything like the value being rounded to zero or something like that. So why would it fail with a "divide by zero" error? Let's trace through our fix_div assembly routine with these inputs and see what's going on.

mov eax, 0x00020000      ; our dividend value "2.0f" fixed-point represented as a 
                         ; 32-bit int in hex
mov ebx, 0x00000003      ; the divisor value "0.000061f" fixed-point represented as a 
                         ; 32-bit int in hex
shld edx, eax, 16
shl eax, 16
idiv ebx

Now, let's take a look at our register values when execution reaches the idiv. ebx will of course just have the value 3 in it. But here is how our dividend value looks after it has been sign-extended and shifted into 64-bits across edx:eax:

edx = 0000000000000000 0000000000000010
eax = 0000000000000000 0000000000000000

How does this result in a "divide by zero" error? Let's just manually calculate the result of this division ourselves and see if we can spot the issue. We're dividing the 64-bit value 8589934592 by 3 which is 2863311530. This result does fit into 32-bits, not a problem. So what is wrong? I'll freely admit that I didn't see the problem with this particular result at all and was quite confused until I finally decided to look at the bits for it:

2,863,311,530 = 1010101010101010 1010101010101010

Oooohh, pretty it's a nice repeating pattern. Ok, yeah it's pretty I guess, but the real problem is actually that this result value has the most significant bit set to 1 and we're performing signed division with idiv. The result of this division cannot be represented in just 31 bits (32 bits minus the sign bit), so idiv throws a "divide by zero" error, which is actually just a "divide error", but most debuggers will present this as a "divide by zero" error which is kinda confusing.

Ok, so ... fine ... idiv is working as advertised. No problem!

Except that this can be annoying when you don't have a good way to trap this or otherwise handle it in some way that doesn't result in your ray caster crashing just because you were turning around in a full circle and these problematic values ended up in a division by chance (as happened to me). How to handle this properly is going to vary from application to application, but regardless of how you want to handle it you will need a way to catch this error in the first place and return some kind of error code or value that we can detect.

Eventually I discovered that the fixed-point division routine in Allegro has a way of working around this and returning an error result if the division would overflow. After seeing how Allegro was doing it I had an "oh, duh!" moment because it seemed so obvious in hindsight, heh. However, I will not try to hide my oversight on coming up with this solution on my own and am happy to give full credit to them!

The solution is to just perform the signed division with div instead of idiv, but handle the manipulation of the sign bits yourself, as well as checking for the overflow conditions yourself, and when they arise, return some sort of error code/value of your choosing. This is simple to implement as we know from math classes in school from way back that dividing two positive values together results in a positive value. Dividing two negative values together also results in a positive. But dividing a negative and a positive results in a negative. So we can manipulate the sign bits ourselves based on this understanding. This does end up being less performant than our existing fix_div routine which is unfortunate. If you know that you will never ever be doing divisions with values that will overflow like this, then you need not worry and can continue using the smaller, faster fix_div already shown! I'm not a math genius, so I do not really know off the top of my head if there is even a straight-forward way to just "know" this in advance. That would surely be nice though.

Let's test this solution out:

    mov eax, 0x00020000      ; our dividend value "2.0f" fixed-point represented as a 
                             ; 32-bit int in hex
    mov ebx, 0x00000003      ; the divisor value "0.000061f" fixed-point represented as a 
                             ; 32-bit int in hex

    xor ecx, ecx
    or eax, eax              ; check if the dividend is negative
    jns checkDivisorSign     ; move on to checking the divisor if the dividend is 
                             ; already positive
    neg eax                  ; dividend is negative. make it positive
    inc ecx                  ; increment counter, indicating that we switched the sign of 
                             ; one of the values
    or ebx, ebx              ; check if the divisor is negative
    jns divide               ; move on to performing the division if the divisor is 
                             ; already positive
    neg ebx                  ; divisor is negative. make it positive
    inc ecx                  ; increment counter, indicating that we switched the sign of 
                             ; one of the values
    xor edx, edx
    shld edx, eax, 16        ; convert dividend value to 32:32 fixed point. no need to 
                             ; sign extend this time
    shl eax, 16
    cmp edx, ebx             ; one last check for possible division result overflow
    jae error                ; if found, skip right to returning an error result
    div ebx                  ; divide! using unsigned division this time. result will 
                             ; be positive!
    or eax, eax              ; test division result's sign bit. if it is set, the division 
                             ; result is too big
    jns restoreSignBit       ; however, if the result's sign bit was not set, the result is 
                             ; fine. skip to return
    mov eax, 0x7fffffff      ; this is our error code result. because why not?
    jmp done                 ; and skip to the end. we just want to return this error 
                             ; value as-is
    cmp ecx, 1               ; if ecx=1, only one of the values was negative and we should 
                             ; negate the result
    jne done                 ; if ecx=0 or ecx=2, then leave the result positive (neither 
                             ; or both were negative)
    neg eax                  ; negate result, returning a negative value

We check the sign bits of the divisor and dividend separately and if either is negative, we flag it using ecx as a counter for how many negative values we found, and we also force both the divisor and dividend to be positive going into the div. There is an additional last-minute check before the div runs, but after the conversion of our dividend to 32:32 fixed-point representation. We check the high double-word of the dividend by itself against our divisor to see if it is larger or equal. This would end up with a division result that is always going to be too large to fit in a 32-bit signed result and the div operation would end up throwing a "divide error", so we definitely want to catch this scenario and if found, return an error value instead of letting the CPU crash our application (that's the whole reason that we're writing this monstrosity of a division routine after all!). Immediately after the div is performed we need to check the sign bit of the result. If it is set, then the division result was too big to fit in a 32-bit unsigned value (remember div is doing unsigned division) and we need to set our error result in this case. Finally, we need to check if we need to manually set the sign bit on the division result. This is only needed if ecx is 1. The only way that ecx would be 1 is if only one of the values (either the dividend or divisor) was negative. And since a division with a positive and negative value results in a negative, we need to make sure our result is negative in this case (which it will never be unless we set this here).

Phew! For completeness-sake, here is the final Watcom C/C++ routine:

fixed fix_div_safe(fixed a, fixed b);
#pragma aux fix_div_safe =      \
    "    xor ecx, ecx"          \
    "    or eax, eax"           \
    "    jns checkDivisorSign"  \
    "    neg eax"               \
    "    inc ecx"               \
    "checkDivisorSign:"         \
    "    or ebx, ebx"           \
    "    jns divide"            \
    "    neg ebx"               \
    "    inc ecx"               \
    "divide:"                   \
    "    xor edx, edx"          \
    "    shld edx, eax, 16"     \
    "    shl eax, 16"           \
    "    cmp edx, ebx"          \
    "    jae error"             \
    "    div ebx"               \
    "    or eax, eax"           \
    "    jns restoreSignBit"    \
    "error:"                    \
    "    mov eax, 0x7fffffff"   \
    "    jmp done"              \
    "restoreSignBit:"           \
    "    cmp ecx, 1"            \
    "    jne done"              \
    "    neg eax"               \
    "done:"                     \
    parm [eax] [ebx]            \
    modify [ecx edx]            \
    value [eax];

// let's try it out!

fixed a, b, c;

a = FTOFIX(2.0f);
b = FTOFIX(0.000061f);
c = fix_div_safe(a, b);            // will be 0x7fffffff, signalling an error!

// now lets see how it handles with successful divisions ...

a = ITOFIX(8);
b = ITOFIX(2);
c = fix_div_safe(a, b);            // will be "4.0f" in fixed-point

a = ITOFIX(-8);
b = ITOFIX(2);
c = fix_div_safe(a, b);            // will be "-4.0f" in fixed-point

a = ITOFIX(8);
b = ITOFIX(-2);
c = fix_div_safe(a, b);            // will be "-4.0f" in fixed-point

a = ITOFIX(-8);
b = ITOFIX(-2);
c = fix_div_safe(a, b);            // will be "4.0f" in fixed-point


This ended up being much longer then I had originally imagined. And I have some more things to say about fixed-point math for old retro-computers. There are useful trigonometry functions and other useful math functions, like a sqrt() function, that we will want fixed-point equivalents for. But that will be covered in a future post.