Fixed Point Math
So I had actually written a bunch of fixed-point math and support stuff for myself a couple years ago and had figured that I was all done with it. There are many articles and books out there written on the topic, so I do feel like it is very well covered by this point. But I was working on something a couple months ago that highlighted a bit of a gap in my fixed-point math code and noticed that none of the books that I personally owned actually discussed this particular issue, 'nor did any of the articles on the topic that I had bookmarked. Maybe the authors thought that it was "obvious" and didn't need explanation? I dunno.
At any rate, encountering that problem made me decide that I would just go ahead and write an article on fixed-point math with a specific focus on retro-programming (because of course I would do this) and include the solution for the specific issue I had found. Maybe someone else will find this useful.
Floating-Point and x86
Normally in your code, when you want to work with numbers with decimals, such as 3.14159, you would use a
floating-point type. For example, in C/C++, this would be float
or double
. These types use the IEEE-754
representation for holding numbers. Using floating-point data types is of course very common today. You don't even
think twice about it when you need to work with non-integer numbers. You just use these types and go on your merry way.
Way back when, CPUs didn't have built-in hardware support for floating-point data types. For example, to get hardware support on x86 CPUs, you needed to have a separate math co-processor, like the Intel 80x87. Having one of these installed would mean that you could use your programming language's built-in floating-point data types, and they would execute floating-point instructions using the math co-processor, which was significantly faster than the alternative of using an emulation library. Many compilers made switching between use of an emulation library or using native 80x87 instructions easy (usually just a compiler switch or something similar... no code changes needed). Of course, if you didn't have a math co-processor installed or you were distributing programs to users who might not have a math co-processor, you had no choice but to rely on some sort of emulation.
The 486 was the first Intel x86 CPU with the math co-processor functionality built in. Well, unless you had an SX, heh. Ok, well at least from the Pentium-onwards, it was all built-in without any Intel-manually-disabling-the-built-in-support-purely-for-money-grubbing-purposes nonsense. Before the 486/Pentium were commonplace, as a software developer, you really couldn't just assume that everyone running your code had a math co-processor available.
As an aside, even today you could still come across modern hardware, such as embedded systems, with no hardware floating-point support. Even back then, once Pentium processors were much more common place in desktop computers, you still had stuff like video game consoles such as the original Playstation, which had no hardware floating-point support, despite the fact that games for that console heavily featured impressive (for the time) 3D graphics. Another somewhat later example of this is the original Nintendo DS.
Enter Fixed-Point
Fixed-point is an alternative to using floating-point that is useful when you need to write code that will run on machines where either no math co-processor will be available, or even if it is, it would be too slow to run the number of floating-point calculations that are going to be needed. Fixed-point will not give you exactly the same levels of precision as floating-point would, but for many purposes, it is "good enough."
Fixed-point works by representing floating-point numbers using otherwise normal integer data types (e.g. in C/C++,
using an int
or long
) and arbitrarily dividing the bits into whole number (integer) and decimal (fractional) parts.
For example, if we imagine the bits in a 32-bit integer:
0000000000101101 0010100011110101
whole number decimal
The above shows the 16:16 fixed-point representation of the number 45.16
. The high word (16 bits) represents the whole
number portion (45
), while the lower word represents the decimal (0.16
). As you can see, with this "16:16"
representation we've decided to place an imaginary decimal point between the high and low 16 bits of a 32-bit integer.
This means that the decimal point doesn't move around as it can with real floating-point. Instead, it is fixed in place.
Nothing stops you from shifting the decimal either in your own fixed-point math routines. You might prefer to use say, 24 bits for the whole number portion and only 8 bits for the decimal. It depends on how much precision you need. You could even define multiple different fixed-point types which specify different levels of precision. But the point is that with fixed-point, we're essentially defining a convention for ourselves on where the decimal point will be.
Of course, in order to use some other custom decimal placement, you would need to customize your fixed-point math routines from what I will show in this article.
This 16:16 representation obviously imposes some limits on the maximum values we can hold in such a fixed-point data type. With 16 bits for the whole number (integer) portion, we can only hold values +/- 32767. And with 16 bits for the decimal (fractional) portion, we have a maximum precision of about 0.000015. Note that we're going to be using signed values (because our programming language's normal floating-point data types all allow us to use signed values, and so we want to do the same because that's the whole point of this ... to get a fast equivalent to floating-point, right?). Because we're using 32-bit integers, the sign bit is going to live in the upper 16-bits.
Now, before we go any further ... why exactly is it better to use fixed-point on these older x86 CPUs?
As mentioned previously, if your code is to run on a system that does not have 80x87 instruction support at all, either via a math co-processor or by having built-in support (as in 486DX or later x86 CPUs), then you need to rely on some sort of emulation library that your compiler probably provides for you which is definitely going to run slower than the equivalent hardware support would.
But, just because your 286/386/486-equipped computer has 80x87 support, it doesn't mean that using those instructions is necessarily the best idea if you absolutely need raw speed. Until the Pentium rolled in, 80x87 instructions could be somewhat slow even when the hardware was present. Consider the following instruction timing listings (numbers shown are the number of cycles needed):
80x87 Instruction Timings | |||
---|---|---|---|
Operation | 387 | 486 | Pentium |
fadd | 23-34 | 8-20 | 3/1* |
fsub | 26-37 | 5-17 | 3/1* |
fmul | 29-57 | 16 | 3/1* |
fdiv | 88-91 | 73 | 39 |
* = latency/throughput
x86 Instruction Timings (32-bit register operands) | |||
---|---|---|---|
Operation | 386 | 486 | Pentium |
add | 2 | 1 | 1 |
sub | 2 | 1 | 1 |
imul | 9-38 | 12-42 | 10 |
mul | 9-38 | 12-42 | 10 |
idiv | 43 | 43 | 46 |
div | 38 | 40 | 41 |
Since fixed-point uses normal integer data types, the fixed-point math we will be using just uses x86 instructions like
add
, sub
, imul
, idiv
, etc. That is where we will gain speed over using 80x87 instructions or (especially)
using emulation libraries.
Note that I've omitted 286 or earlier timings from the above tables. Obviously things get even slower the further back you go. I will not be discussing fixed-point math in detail for x86 CPUs prior to the 386 in this article. On those CPUs you lack 32-bit registers (as well as some of the specific assembly instructions I will be using here), which means that you end up either having to settle for something like 8:8 fixed-point, or to use two 16-bit registers in order to provide support for something like 16:16 fixed-point. This obviously complicates the fixed-point math code, but it is certainly doable.
A Fixed-Point Data Type
Let's start by defining our fixed-point data type. For this article, I'm going to go with a 16:16 fixed-point data type, because that's what I use in my own projects that use fixed-point. Obviously it makes sense to use a 32-bit integer:
typedef int fixed; // if you're using a 32-bit compiler anyway.
// maybe use int32_t if you have stdint.h
There's our 32-bit fixed-point data type, fixed
. Simple. Note that we want to be able to use signed values, as that
is usually very useful in applications using floating-point calculations, especially so in games.
Of course, if you were using a 16-bit compiler like Borland Turbo C++, or one of the old DOS Microsoft C/C++ compilers,
and you wanted to have a 16:16 fixed-point data type, then you would need to define your fixed
type like so:
typedef long fixed;
I use Watcom C/C++ myself, so the rest of this article is going to be oriented towards that compiler.
Converting To And From Fixed-Point
So, obviously, we cannot just do something like this to assign a value to our fixed-point type:
fixed f = 3.14159;
Because fixed
is actually an int
, so you won't get the right value as the decimal would be dropped. Plus, the
integer value that your compiler would assign here will be "wrong" as far as our fixed-point representation is
concerned. The bits for the integer 3
would be located in the lower 16 bits which in our 16:16 representation, we've
designated as the decimal (fractional) portion of our fixed-point representation.
What we need to do instead is shift the value so the whole number and decimal bits get stored in the right places. This needs to be done differently when assigning integer values to a fixed-point variable then it does when assigning actual floating-point values to a fixed-point variable. This is because you can actually bit-shift an integer value to move the bits to the right spot. But you cannot (meaningfully, at least) bit-shift a floating-point value due to its IEEE representation. Instead, for floating-point values, we just multiply them by the equivalent of the integer bit-shift, and then cast the resulting value to our 32-bit integer type.
Let's start with integer to fixed-point conversions first:
fixed a = 42 << 16; // a is now "42.0f" in fixed-point representation
We shift 42
by 16 bits to the left, to place the integer value 42
in the upper 16 bits of the fixed-point value
because we've designated the upper 16 bits as the whole number (integer) portion of our fixed-point representation.
There is no decimal portion to assign here obviously, so this is all that's needed. The decimal portion that we will be
left with here will just be zero, thus the resulting fixed-point equivalent value 42.0f
.
The bits of our fixed-point variable a
will look like this:
0000000000101010 0000000000000000
Now for floating-point to fixed-point conversions:
fixed b = ; // b is now "3.14159f" in fixed-point representation
Here we multiply 3.14159f
by 65536.0f
(65536 = 2^16, in other words, the equivalent multiplication to the bit-shift
we used for the integer previously) and cast the result to our fixed
type (which is just a 32-bit integer). Casting a
floating-point value in this way would normally mean dropping the decimal portion, but since we've already shifted the
decimal portion up with the multiplication, we don't actually lose it during the cast.
The bits of our fixed-point variable b
will look like this:
0000000000000011 0010010000111111
To convert fixed-point numbers back to either integer or floating-point values, we just do the opposite. Here is how you could do a fixed-point to integer conversion:
int i = a >> 16; // i is now "42" (converted from "42.0f" fixed-point)
Looking at the previously shown bits for the fixed-point variable a
that should be pretty self-explanatory. In this
type of conversion, obviously any decimal portion (if there was any) is discarded.
As a little something extra, if we wanted we could do a fixed-point to integer conversion that rounds up any decimal
portion to the next whole number before the doing the right-shift to turn it back into a normal integer. We could do
this by adding the equivalent of 0.5f
in fixed-point, which for our 16:16 format is 0x8000
.
int i = >> 16; // if a was "3.5f" fixed-point, i would now be 4
// however, if a was "3.4f" fixed-point, i would now be 3
Now for fixed-point to floating-point conversions. Again, we cannot just do a right-shift. The fixed-point value we have is represented as an integer and to turn it into a real floating-point value we need to get it back into that IEEE representation.
float f = ; // f is now "3.14159f" (converted from "3.14159f" fixed-point)
It is obviously useful to define some helper functions or macros for doing all of these conversions. These are the macros that I use:
Doing Fixed-Point Math
Addition and Subtraction
Luckily, addition and subtraction can be done as normal. Add and subtract away!
fixed a, b, c;
a = ;
b = ;
c = a + b; // c will be "8.9f" fixed-point
c = a - b; // c will be "-6.5f" fixed-point
One Little Gotcha
One very important consideration when doing fixed-point math is that you need to make sure that you are not unintentionally mixing fixed-point and non-fixed-point values together within the same calculations.
Convert as necessary to make sure you're not mixing them together! If you want a fixed-point result, make sure all the values being used in the calculation are fixed-point. If you want an integer result, make sure any fixed-point values used in the calculation are converted to integers. Same thing goes if you want a floating-point result.
Here's an example to illustrate this point.
int a;
fixed b, c;
a = 10;
b = ; // b = "20.0f" in fixed-point
c = b + a;
What is the result left in c
? It's not going to be fixed-point 30.0f
. It's actually going to be the fixed-point
equivalent of 20.000153f
.
Why is that? Because b
is 20.0f
in fixed-point. Meaning that the whole number portion 20
is shifted to the left
so that it's in the high 16 bits of b
. Then we add a
to it. But a
is just the plain integer value 10
. Because
it is just a normal integer value, that means its bits are located in the lower 16 bits of that variable. Adding a
and b
together has the probably-unintended consequence of treating a
as if it were a really, really small
fixed-point value (equivalent to a really small fraction in floating-point). When the addition is performed, the value
10
ends up in the resulting fixed-point value c
's lower 16 bits, which is where the decimal portion of the
fixed-point value lives in our 16:16 representation, while the 20
will be left in c
's higher 16-bits.
To hopefully more clearly illustrate, let's look at the bits for a
and b
after their initial values are assigned:
a = 0000000000000000 0000000000001010
b = 0000000000010100 0000000000000000
As you can probably imagine after seeing this, after the addition is performed the bits in c
will look like this:
c = 0000000000010100 0000000000001010
Of course there are a couple ways to fix this. The way to fix this if a
really did need to be defined as a normal
int
, and we really did want to add that integer value to the fixed-point value b
, we would need to rewrite the
addition so that all of the values being added together are being added as fixed-point values by just doing an
integer-to-fixed-point conversion on the spot for a
:
c = b + ;
Which would give us the value we expected, 30.0f
in fixed-point.
Multiplication and Division
Multiplication and division can be done using normal means, but with some extra tricks being needed.
Multiplication
Let's start with a simple multiplication and see what we're dealing with.
fixed a, b, c;
a = ;
b = ;
c = a * b; // will this be "8.0f" in fixed-point?
What ends up in c
? Zero! What!? Let's take a look at the bits of our fixed-point values a
and b
.
a = 0000000000000010 0000000000000000
b = 0000000000000100 0000000000000000
These two values should make sense to you now. The actual underlying 32-bit integer values being multiplied here are
131072
and 262144
, the result of which is 34359738368
, which is huge. Too big for a 32-bit integer to hold. Our C
compiler is going to compile this multiplication using an imul
instruction which places the resulting 64-bit value in
two registers, the high double-word in edx
and the low double-word in eax
. 34359738368
shown as 64-bits looks like
this:
0000000000000000 0000000000001000 0000000000000000 0000000000000000
|---- high double-word (edx) ---| |---- low double-word (eax) ----|
The compiler just ended up returning the low double-word value (which is 0
) because our code only specified a 32-bit
variable to hold the result.
But hold on, why is fixed-point 2.0
multiplied by fixed-point 4.0
such a gigantic result? It's because in order to
represent our fixed-point values, we are effectively multiplying them individually by 2^16 = 65536 (left-shift of 16)
before using them in the multiplication. When we then multiply them together, this "scaling" of each of the numbers gets
factored into the result. The result we got, 34359738368
, is actually 8 * 2^32 (essentially this result is in 32:32
fixed-point representation). But we expected to get 8 * 2^16 (16:16 representation) for our fixed-point result.
So you can probably see what the fix here is. We just need to shift the resulting 64-bit value right by the same amount as what our fixed-point shift amount is, 16 in our case for our 16:16 representation. But we have a little wrinkle in our plan still. The result was a 64-bit value spread across two 32-bit registers.
Let's test a little multiplication tweak using our fixed-point values from above and see what we can do:
mov eax, 131072 ; "2.0f" fixed-point represented as a 32-bit int
mov ebx, 262144 ; "4.0f" fixed-point represented as a 32-bit int
imul ebx
shrd eax, edx, 16 ; right-shift the entire 64-bit value edx:eax back by 16 bits into eax
After the shrd
instruction, we will have the following bits in eax
:
0000000000001000 0000000000000000
Which is 524288
, which is 8 * 2^16 ... which is the exact fixed-point value (8.0f
) we expected to get from our
multiplication. Phew!
Now that we know how to deal with this, we can turn this bit of assembly into a function. With Watcom C/C++:
fixed ;
// let's try this again ...
fixed a, b, c;
a = ;
b = ;
c = ; // this will now be "8.0f" in fixed-point!
// negative numbers work too of course
a = ;
b = ;
c = ; // will be "-15.75f" in fixed-point
Division
Division is similarly tricky, but let's walk through it so that we understand exactly what's going on.
Let's start with a simple multiplication and see what we're dealing with.
fixed a, b, c;
a = ;
b = ;
c = a / b; // will this be "2.0f" in fixed-point?
What ends up in c
this time? 2
! Yay, that was easy! Wait... no, that's wrong! We expected to get 2.0f
in
fixed-point, not 2
as a plain ol' integer!
The problem here is basically the reverse of what it was with multiplication. Those of you reading this with math degrees probably already expected this. But alas, I am a bit of a math dummy, so all these simple concepts I learnt back in school I had to re-learn again here while exploring fixed-point math.
Our two fixed-point values a
and b
have been pre-multiplied (left-shifted) in order to turn them into 16:16
fixed-point values. When dividing, this multiplication factor present in both of the values gets cancelled out, and so
the divide basically un-shifts our fixed-point values back into integers and divides those integers all at once.
Probably the easiest fix here is to convert the dividend, a
, into 32:32 fixed-point using a 64-bit value across
edx:eax
before the division. This would be un-coincidentally convenient as when using idiv
with a 32-bit divisor as
we are, it is actually dividing the 64-bit value in edx:eax
anyway. The result will be a 32-bit value in eax
.
Now, we cannot just left-shift our a
value by 16 so that it sits across edx:eax
because our fixed-point values are
signed. We need to sign-extend our 32-bit value a
across 64-bits before we do the left-shift that is necessary to
convert our 16:16 fixed-point representation value in a
so that it is in proper 32:32 format. And of course, that
left-shift needs to occur across eax
and into edx
. The cdq
instruction will do the sign-extension into 64-bits
for us, while shld
will handle the left-shift into edx
from eax
.
Let's test this out quickly with a bit of assembly:
mov eax, 0x4cccc ; our dividend value "4.8f" fixed-point represented as a 32-bit int in hex
mov ebx, 0x26666 ; the divisor value "2.4f" fixed-point represented as a 32-bit int in hex
cdq ; sign extend eax into edx:eax
shld edx, eax, 16 ; shift edx left by 16, using bits from eax to fill in on the right of
; edx on each shift
shl eax, 16 ; shld leaves eax unmodified though. so shift eax left by 16 also to
; finish fixing edx:eax
idiv ebx
To understand this a bit better, lets look at the bits of our dividend value 4.8f
fixed-point as we're transforming
it here before the idiv
is finally executed.
After the value is initially set, eax
has:
eax = 0000000000000100 1100110011001100
After the cdq
is executed, we end up with the following 64-bit value across edx:eax
:
edx = 0000000000000000 0000000000000000
eax = 0000000000000100 1100110011001100
Because our dividend value in eax
is positive, the sign bit value of 0
is extended all the way through edx
. If the
dividend value was negative, we would expect to see a bunch of 1
s extended all the way through edx
. Next we begin
shifting our dividend value into 32:32 fixed-point representation with the shld
:
edx = 0000000000000000 0000000000000100
eax = 0000000000000100 1100110011001100
shld
leaves the second register operand, in this case eax
, unaltered. So, we have our edx
value shifted in from
the high 16 bits of eax
now. So edx
is now all good for 32:32 fixed-point representation. The whole number portion
4
is there. Now we need to fix eax
by shifting it left 16 as well using shl
.
edx = 0000000000000000 0000000000000100
eax = 1100110011001100 0000000000000000
Great! Now edx
has the correct high 32-bits representing the whole number portion of our original 16:16 fixed-point
value, and eax
has the correct low 32-bits representing the decimal portion. Sure, the left-shift on eax
ended up
leaving a bunch of extra zeros in the low 16-bits, but we didn't have any additional bits from our original value
anyway, so this is fine and won't affect our division at all.
After the idiv
executes, we get the following 32-bit result in eax
:
eax = 0000000000000010 0000000000000000
Which is exactly what we want, 2.0f
as 16:16 fixed-point representation! No more shifting of futzing about with
values is needed. We can return this into our fixed
result variable as-is.
Phew! Oh my! Now we can package this assembly up into a function too. For Watcom C/C++:
fixed ;
// let's give it a try ...
fixed a, b, c;
a = ;
b = ;
c = ; // this will now be "2.0f" in fixed-point!
// and again, negative numbers work too
a = ;
b = ;
c = ; // will be "-8.0f" in fixed-point
A Big Division Gotcha
These methods are the usual ones that many people have better written about before me over the past decades. I cannot
take credit for any of this. However, there is at least one big gotcha with doing division this way with idiv
that
I've yet to see a book or article mention that discusses fixed-point math.
Imagine for a moment that you are fiddling around with a ray caster just for fun, and also as a means to give your
fixed-point math routines a good work out (as I was at some point). Also imagine that at some point as your ray caster
test program is running, it ends up executing a division (as mine was). And finally, let us imagine that the values
being divided were, say, 2.0f
and 0.000061f
fixed-point (as eventually happened to me).
fixed a, b, c;
a = ;
b = ;
c = ; // will be "32786.8852f" in fixed-point .... right?
And then imagine that your code fails at the idiv
operation run by fix_div
with a "divide by zero" error.
First off, what the heck? We aren't dividing by zero here! Our divisor value 0.000061f
fixed-point is greater than
what we figured our smallest precision was (mentioned near the beginning of this article), so it's probably not anything
like the value being rounded to zero or something like that. So why would it fail with a "divide by zero" error? Let's
trace through our fix_div
assembly routine with these inputs and see what's going on.
mov eax, 0x00020000 ; our dividend value "2.0f" fixed-point represented as a
; 32-bit int in hex
mov ebx, 0x00000003 ; the divisor value "0.000061f" fixed-point represented as a
; 32-bit int in hex
cdq
shld edx, eax, 16
shl eax, 16
idiv ebx
Now, let's take a look at our register values when execution reaches the idiv
. ebx
will of course just have the
value 3
in it. But here is how our dividend value looks after it has been sign-extended and shifted into 64-bits
across edx:eax
:
edx = 0000000000000000 0000000000000010
eax = 0000000000000000 0000000000000000
How does this result in a "divide by zero" error? Let's just manually calculate the result of this division ourselves
and see if we can spot the issue. We're dividing the 64-bit value 8589934592
by 3
which is 2863311530
. This
result does fit into 32-bits, not a problem. So what is wrong? I'll freely admit that I didn't see the problem with
this particular result at all and was quite confused until I finally decided to look at the bits for it:
2,863,311,530 = 1010101010101010 1010101010101010
Oooohh, pretty it's a nice repeating pattern. Ok, yeah it's pretty I guess, but the real problem is actually that this
result value has the most significant bit set to 1
and we're performing signed division with idiv
. The result of
this division cannot be represented in just 31 bits (32 bits minus the sign bit), so idiv
throws a "divide by zero"
error, which is actually just a "divide error", but most debuggers will present this as a "divide by zero" error which
is kinda confusing.
Ok, so ... fine ... idiv
is working as advertised. No problem!
Except that this can be annoying when you don't have a good way to trap this or otherwise handle it in some way that doesn't result in your ray caster crashing just because you were turning around in a full circle and these problematic values ended up in a division by chance (as happened to me). How to handle this properly is going to vary from application to application, but regardless of how you want to handle it you will need a way to catch this error in the first place and return some kind of error code or value that we can detect.
Eventually I discovered that the fixed-point division routine in Allegro has a way of working around this and returning an error result if the division would overflow. After seeing how Allegro was doing it I had an "oh, duh!" moment because it seemed so obvious in hindsight, heh. However, I will not try to hide my oversight on coming up with this solution on my own and am happy to give full credit to them!
The solution is to just perform the signed division with div
instead of idiv
, but handle the manipulation of the
sign bits yourself, as well as checking for the overflow conditions yourself, and when they arise, return some sort of
error code/value of your choosing. This is simple to implement as we know from math classes in school from way back that
dividing two positive values together results in a positive value. Dividing two negative values together also results in
a positive. But dividing a negative and a positive results in a negative. So we can manipulate the sign bits ourselves
based on this understanding. This does end up being less performant than our existing fix_div
routine which is
unfortunate. If you know that you will never ever be doing divisions with values that will overflow like this, then
you need not worry and can continue using the smaller, faster fix_div
already shown! I'm not a math genius, so I do
not really know off the top of my head if there is even a straight-forward way to just "know" this in advance. That
would surely be nice though.
Let's test this solution out:
mov eax, 0x00020000 ; our dividend value "2.0f" fixed-point represented as a
; 32-bit int in hex
mov ebx, 0x00000003 ; the divisor value "0.000061f" fixed-point represented as a
; 32-bit int in hex
xor ecx, ecx
or eax, eax ; check if the dividend is negative
jns checkDivisorSign ; move on to checking the divisor if the dividend is
; already positive
neg eax ; dividend is negative. make it positive
inc ecx ; increment counter, indicating that we switched the sign of
; one of the values
checkDivisorSign:
or ebx, ebx ; check if the divisor is negative
jns divide ; move on to performing the division if the divisor is
; already positive
neg ebx ; divisor is negative. make it positive
inc ecx ; increment counter, indicating that we switched the sign of
; one of the values
divide:
xor edx, edx
shld edx, eax, 16 ; convert dividend value to 32:32 fixed point. no need to
; sign extend this time
shl eax, 16
cmp edx, ebx ; one last check for possible division result overflow
jae error ; if found, skip right to returning an error result
div ebx ; divide! using unsigned division this time. result will
; be positive!
or eax, eax ; test division result's sign bit. if it is set, the division
; result is too big
jns restoreSignBit ; however, if the result's sign bit was not set, the result is
; fine. skip to return
error:
mov eax, 0x7fffffff ; this is our error code result. because why not?
jmp done ; and skip to the end. we just want to return this error
; value as-is
restoreSignBit:
cmp ecx, 1 ; if ecx=1, only one of the values was negative and we should
; negate the result
jne done ; if ecx=0 or ecx=2, then leave the result positive (neither
; or both were negative)
neg eax ; negate result, returning a negative value
done:
We check the sign bits of the divisor and dividend separately and if either is negative, we flag it using ecx
as a
counter for how many negative values we found, and we also force both the divisor and dividend to be positive going
into the div
. There is an additional last-minute check before the div
runs, but after the conversion of our
dividend to 32:32 fixed-point representation. We check the high double-word of the dividend by itself against our
divisor to see if it is larger or equal. This would end up with a division result that is always going to be too large
to fit in a 32-bit signed result and the div
operation would end up throwing a "divide error", so we definitely want
to catch this scenario and if found, return an error value instead of letting the CPU crash our application (that's the
whole reason that we're writing this monstrosity of a division routine after all!). Immediately after the div
is
performed we need to check the sign bit of the result. If it is set, then the division result was too big to fit in a
32-bit unsigned value (remember div
is doing unsigned division) and we need to set our error result in this case.
Finally, we need to check if we need to manually set the sign bit on the division result. This is only needed if ecx
is 1
. The only way that ecx
would be 1
is if only one of the values (either the dividend or divisor) was
negative. And since a division with a positive and negative value results in a negative, we need to make sure our
result is negative in this case (which it will never be unless we set this here).
Phew! For completeness-sake, here is the final Watcom C/C++ routine:
fixed ;
// let's try it out!
fixed a, b, c;
a = ;
b = ;
c = ; // will be 0x7fffffff, signalling an error!
// now lets see how it handles with successful divisions ...
a = ;
b = ;
c = ; // will be "4.0f" in fixed-point
a = ;
b = ;
c = ; // will be "-4.0f" in fixed-point
a = ;
b = ;
c = ; // will be "-4.0f" in fixed-point
a = ;
b = ;
c = ; // will be "4.0f" in fixed-point
Done?
This ended up being much longer then I had originally imagined. And I have some more things to say about fixed-point
math for old retro-computers. There are useful trigonometry functions and other useful math functions, like a sqrt()
function, that we will want fixed-point equivalents for. But that will be covered in a future post.