fixed point integer division ("fractional division") algorithm - emulation

The Honeywell DPS8 computer (and others) have/had a "divide fractional" instruction:
"This instruction divides a 71-bit fractional dividend (including sign) by a 36-bit
fractional divisor (including sign) to form a 36-bit fractional quotient (including
sign) and a 36-bit fractional remainder (including sign). Bit 35 of the remainder
corresponds to bit 70 of the dividend. The remainder sign is equal to the dividend
sign unless the remainder is zero."
So, as I understand it, this is integer division with the decimal point way over on the left.
.qqqqq / .ddddd
(I did scaled integer math in FORTH back in the day, but my memories of the techniques are lost in fog of time.)
To implement this instruction in a DPS8 emulator, I believe I need to start by creating two 70 bit numbers: the 71 bit dividend less it's sign bit, and the the 36 bit divisor less its sign bit and shifted 35 bits to the left so that the decimal points line up.
I think I can then form the remainder and quotient (in C) with '%' and '/', but I am unsure if those results need to be normalized (i.e. shifted).
I found an example of a "shift and subtract" algorithm "Computer Arithmetic", slide 10), but I would prefer a more straight forward implementation.
Am I on the right track, or is the solution more nuanced (fixing up the signs and detection of errors have been elided from here; those stages are well documented. The actual division is the issue.). Any pointers to C implementations of this kind of hardware emulation would be particularly helpful.

I do not have the definitive answer, but as a division is a division, you might find it helpful to look at some basic division routines.
Imagine that you have a 32-bit variable and you want an 8-bit fractional part.
You then have an integer part between 0 and 16777215, and a fractional part which is between 0 and 255.
0xiiiiiiff (where i is the integer part, f is the fractional part).
Imagine you have a 24-bit dividend (numerator), say the value 3, and a 24-bit divisor (denominator), say the value 13.
As we quickly will see, 3/13 is greater than zero and less than one. That means our fractional part is nonzero, but our integer part is filled completely with zeros.
So to do the above division using a standard divide function, we'll just bit-shift the dividend by N, thus we will get N bits of precision in our fractional part.
quotient_fp = (dividend_ip << 8) / divisor_ip
So far, so good.
But what if we want the divisor to have a fractional part, then ?
If we just shift the divisor up by 8, then we'll have a problem:
(dividend_ip << 8) / (divisor_ip << 8)
- because we'll obviously lose our fractional part of the quotient (result).
Instead, we'll need to shift the dividend up by as many bits as we shift the fractional part up...
((dividend_ip << 8) << 8) / (divisor_ip << 8)
...That makes it...
(dividend_ip << (dividend_precision + divisor_precision) / (divisor_ip << divisor_precision)
Now, let's put our fractional part math into the picture...
(((dividend_ip << dividend_precision) | dividend_fp) << divisor_precision) / ((divisor_ip << divisor_precision) | divisor_fp)
Our quotient's precision will be the same as dividend_precision, which is 8 bits.
Unfortunately, this eats a lot of bits.
Fortunately, in your case, the integer part is not important, so you'll have a lot of room for the fractional part.
Let's increase the precision to 15 bits; this can be tested using normal 32-bit integers...
(((dividend_ip << 15) | dividend_fp) << 15) / ((divisor_ip << 15) | divisor_fp)
Our quotient will now have a 15-bit precision.
OK, but since you're supplying only the fractional parts and the integer part is always zero anyway, you should be able to just toss the integer part. That makes it....
(((dividend_ip << 16) | dividend_fp) << 16) / ((divisor_ip << 16) | divisor_fp)
... reduced to ...
(dividend_fp << 16) / divisor_fp
... now let's use a 64-bit integer instead, we can get 32 bits of precision in the quotient...
(dividend_fp << 32) / divisor_fp
... some compilers have support for a int128_t (it can be enabled on some platforms for GCC), so you might be able to use that type, in order to get 128 bits easily. I have not tried it, but I've come across info on the Web earlier; search for int128_t, and you might find out how.
If you get the int128_t to work, you could make the dividend 128 bit, the divisor 64 bit and the quotient 64 bit...
quotient_fp = ((dividend_fp << 36) / divisor) >> (64 - 36)
... in order to get 36 bits precision.
Notice that since the result is in the top 36 bits of the quotient, the quotient needs to be shifted down (64 - 36) = 28 bits.
You could even go as high as (128 - 36) = 92 bits precision:
(dividend_fp << 92) / divisor
Now, that you probably (hopefully) have a solution, I would like to recommend that you get familiar with low-level binary divide (again; since you've been there a while ago).
The best sources seem to be how hardware divides binary numbers; such as microcontrollers, CPUs and the like. Assembly language dividers are also good for getting to know the inner workings. Often 32-bit divide routines that use bit-shifting are very good sources.
Through the time, I've come across a very clever implementation for ARM in ARM assembly language. Normally I wouldn't post references or assembly language examples, but considering that the code is very small, I think it would be alright.
Taken from A Fast Hi Precision Fixed Point Divide
r0 is the numerator (dividend)
r2 is the denominator (divisor)
mov r1,#0
adds r0,r0,r0
.rept 32
adcs r1,r2,r1,lsl#1
subcc r1,r1,r2
adcs r0,r0,r0
.endr
r0 is the quotient (result)
r1 is the remainder (rest, modulo result)
The above routine contains the basics for an unsigned divide.
I hope this information will be useful. It may contain errors, as I have not tested any code or example mentioned. I'm confident, though, that it's not all wrong. ;)

Related

Strange result from Summation of numbers in Excel and Matlab [duplicate]

I am writing a program where I need to delete duplicate points stored in a matrix. The problem is that when it comes to check whether those points are in the matrix, MATLAB can't recognize them in the matrix although they exist.
In the following code, intersections function gets the intersection points:
[points(:,1), points(:,2)] = intersections(...
obj.modifiedVGVertices(1,:), obj.modifiedVGVertices(2,:), ...
[vertex1(1) vertex2(1)], [vertex1(2) vertex2(2)]);
The result:
>> points
points =
12.0000 15.0000
33.0000 24.0000
33.0000 24.0000
>> vertex1
vertex1 =
12
15
>> vertex2
vertex2 =
33
24
Two points (vertex1 and vertex2) should be eliminated from the result. It should be done by the below commands:
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
After doing that, we have this unexpected outcome:
>> points
points =
33.0000 24.0000
The outcome should be an empty matrix. As you can see, the first (or second?) pair of [33.0000 24.0000] has been eliminated, but not the second one.
Then I checked these two expressions:
>> points(1) ~= vertex2(1)
ans =
0
>> points(2) ~= vertex2(2)
ans =
1 % <-- It means 24.0000 is not equal to 24.0000?
What is the problem?
More surprisingly, I made a new script that has only these commands:
points = [12.0000 15.0000
33.0000 24.0000
33.0000 24.0000];
vertex1 = [12 ; 15];
vertex2 = [33 ; 24];
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
The result as expected:
>> points
points =
Empty matrix: 0-by-2
The problem you're having relates to how floating-point numbers are represented on a computer. A more detailed discussion of floating-point representations appears towards the end of my answer (The "Floating-point representation" section). The TL;DR version: because computers have finite amounts of memory, numbers can only be represented with finite precision. Thus, the accuracy of floating-point numbers is limited to a certain number of decimal places (about 16 significant digits for double-precision values, the default used in MATLAB).
Actual vs. displayed precision
Now to address the specific example in the question... while 24.0000 and 24.0000 are displayed in the same manner, it turns out that they actually differ by very small decimal amounts in this case. You don't see it because MATLAB only displays 4 significant digits by default, keeping the overall display neat and tidy. If you want to see the full precision, you should either issue the format long command or view a hexadecimal representation of the number:
>> pi
ans =
3.1416
>> format long
>> pi
ans =
3.141592653589793
>> num2hex(pi)
ans =
400921fb54442d18
Initialized values vs. computed values
Since there are only a finite number of values that can be represented for a floating-point number, it's possible for a computation to result in a value that falls between two of these representations. In such a case, the result has to be rounded off to one of them. This introduces a small machine-precision error. This also means that initializing a value directly or by some computation can give slightly different results. For example, the value 0.1 doesn't have an exact floating-point representation (i.e. it gets slightly rounded off), and so you end up with counter-intuitive results like this due to the way round-off errors accumulate:
>> a=sum([0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]); % Sum 10 0.1s
>> b=1; % Initialize to 1
>> a == b
ans =
logical
0 % They are unequal!
>> num2hex(a) % Let's check their hex representation to confirm
ans =
3fefffffffffffff
>> num2hex(b)
ans =
3ff0000000000000
How to correctly handle floating-point comparisons
Since floating-point values can differ by very small amounts, any comparisons should be done by checking that the values are within some range (i.e. tolerance) of one another, as opposed to exactly equal to each other. For example:
a = 24;
b = 24.000001;
tolerance = 0.001;
if abs(a-b) < tolerance, disp('Equal!'); end
will display "Equal!".
You could then change your code to something like:
points = points((abs(points(:,1)-vertex1(1)) > tolerance) | ...
(abs(points(:,2)-vertex1(2)) > tolerance),:)
Floating-point representation
A good overview of floating-point numbers (and specifically the IEEE 754 standard for floating-point arithmetic) is What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg.
A binary floating-point number is actually represented by three integers: a sign bit s, a significand (or coefficient/fraction) b, and an exponent e. For double-precision floating-point format, each number is represented by 64 bits laid out in memory as follows:
The real value can then be found with the following formula:
This format allows for number representations in the range 10^-308 to 10^308. For MATLAB you can get these limits from realmin and realmax:
>> realmin
ans =
2.225073858507201e-308
>> realmax
ans =
1.797693134862316e+308
Since there are a finite number of bits used to represent a floating-point number, there are only so many finite numbers that can be represented within the above given range. Computations will often result in a value that doesn't exactly match one of these finite representations, so the values must be rounded off. These machine-precision errors make themselves evident in different ways, as discussed in the above examples.
In order to better understand these round-off errors it's useful to look at the relative floating-point accuracy provided by the function eps, which quantifies the distance from a given number to the next largest floating-point representation:
>> eps(1)
ans =
2.220446049250313e-16
>> eps(1000)
ans =
1.136868377216160e-13
Notice that the precision is relative to the size of a given number being represented; larger numbers will have larger distances between floating-point representations, and will thus have fewer digits of precision following the decimal point. This can be an important consideration with some calculations. Consider the following example:
>> format long % Display full precision
>> x = rand(1, 10); % Get 10 random values between 0 and 1
>> a = mean(x) % Take the mean
a =
0.587307428244141
>> b = mean(x+10000)-10000 % Take the mean at a different scale, then shift back
b =
0.587307428244458
Note that when we shift the values of x from the range [0 1] to the range [10000 10001], compute a mean, then subtract the mean offset for comparison, we get a value that differs for the last 3 significant digits. This illustrates how an offset or scaling of data can change the accuracy of calculations performed on it, which is something that has to be accounted for with certain problems.
Look at this article: The Perils of Floating Point. Though its examples are in FORTRAN it has sense for virtually any modern programming language, including MATLAB. Your problem (and solution for it) is described in "Safe Comparisons" section.
type
format long g
This command will show the FULL value of the number. It's likely to be something like 24.00000021321 != 24.00000123124
Try writing
0.1 + 0.1 + 0.1 == 0.3.
Warning: You might be surprised about the result!
Maybe the two numbers are really 24.0 and 24.000000001 but you're not seeing all the decimal places.
Check out the Matlab EPS function.
Matlab uses floating point math up to 16 digits of precision (only 5 are displayed).

Floating point addition with LSB error

I'm implementing a hardware double precision adder with Verilog. During the verification phase when I compare my hardware output to MATLAB (or C) double precision addition outputs I found some weird cases where the LSB is not matching, taking into account that I'm using the same rounding mode (round to nearest even). My question is about the accuracy of the C calculation, is it truly accurate in doing the rounding or it's limited to some CPU architecture (32 or 64 bits)?
Here's an example,
A = 0x62a5a1c59bd10037 = 1.5944933396238637e+167
B = 0x62724bc40659bf0c = 1.685748657333889e+166 = 0.1685748657333889e+167
The correct output (just by doing the addition of the above real numbers manually)
= 1.7630682053572526e+167 = 0x62a7eb3e1c9c3819 (this matches my hardware)
When I try doing A+B in C, the result is equal to
= 1.7630682053572525e+167 = 0x62a7eb3e1c9c3818
When I try this application to check the intermediate operations
http://www.ecs.umass.edu/ece/koren/arith/simulator/FPAdd/
I can see from mantissa addition that C is not doing the rounding correctly (round to nearest even). In this case the mantissa should be rounded by adding one. Any idea why this is happening?
The operation of http://www.ecs.umass.edu/ece/koren/arith/simulator/FPAdd/ is correct. The last round to nearest even peforms a downward rounding:
A+B + 1.0111111010110011111000011100100111000011100000011000|10 *2^555
^
|
to forget the |10 part (exactly in the middle), the result chooses 0 (even) instead of 1

Verilog operation unexpected result

I am studying verilog language and faced problems.
integer intA;
...
intA = - 4'd12 / 3; // expression result is 1431655761.
// -4’d12 is effectively a 32-bit reg data type
This snippet from standard and it blew our minds. The standard says that 4d12 - is a 4 bit number 1100.
Then -4d12 = 0100. It's okay now.
To perform the division, we need to bring the number to the same size. 4 to 32 bit. The number of bits -4'd12 - is unsigned, then it should be equal to 32'b0000...0100, but it equal to 32'b1111...10100. Not ok, but next step.
My version of division: -4d12 / 3 = 32'b0000...0100 / 32'b0000...0011 = 1
Standart version: - 4'd12 / 3 = 1431655761
Can anyone tell why? Why 4 bit number keeps extra bits?
You need to read section 11.8.2 Steps for evaluating an expression of the 1800-2012 LRM. They key piece you are missing is that the operand is 4'd12 and that it is sized to 32 bits as an unsigned value before the unary - operator is applied.
If you want the 4-bit value treated as a signed -3, then you need to write
intA = - 4'sd12 / 3 // result is 1
here the parser interprets -'d12 as 32 bits number which is unsigned initially and the negative sign would result in the negation of bits. so the result would be
negation of ('d12)= negation of (28 zeros + 1100)= 28ones+2zeros+2ones =
11111111111111111111111111110011. gives output to 4294967283 . if you divide this number (4294967283) by 3 the answer would be 1,431,655,761.
keep smiling :)

why does _mm_mulhrs_epi16() always do biased rounding to positive infinity?

Does anyone know why the pmulhrsw instruction or
_mm_mulhrs_epi16(x) := RoundDown((x * y + 16384) / 32768)
always rounds towards positive infinity? To me, this is terribly biased for negative numbers, because then a sequence like -0.6, 0.6, -0.6, 0.6, ... won't add up to 0 on average.
Is this behavior intentional or unintentional? If it's intentional, what could be the use? Is there an easy way to make it less biased?
Lucky for me, I can just change the order of my operations to get a less biased result (my function is a signed geometric mean):
__m128i ChooseSign(x, sign)
{
return _mm_sign_epi16(x, sign)
}
signsDifferent = _mm_srai_epi16(_mm_xor_si128(a, b), 15) // (a ^ b) >> 15
sign = _mm_andnot_si128(signsDifferent, a) // !signsDifferent & a
//result = ChooseSign(sqrt(a * b), sign) * fraction // biased
result = ChooseSign(sqrt(a * b) * fraction, sign)
A most serious mistake. I asked the same question on the Intel developer forums and andysem corrected me, pointing out the behavior is to round to the nearest integer.
I was mistaken into thinking it was biased because the formula from MSDN, https://msdn.microsoft.com/en-us/library/bb513995.aspx
was (x * y + 16384) >> 15. This looked very similar to the int(x + 0.5) method for rounding, which I know is biased for negative #s and cringe at. But I didn't realize right shift for negative numbers isn't the same as dividing and casting to an int.
Plus, it wasn't matching my non-SIMD reference implementation, which turns out to be biased since I was calculating int(sum / 9.0f), which rounds towards zero.
I should've had more doubt before questioning the behavior of something implemented in hardware, which would've been rigorously vetted, since int(x + 0.5) would be a very expensive mistake.
_mm_mulhrs_epi16() still has some bias of always rounding x.5 towards + infinity. But that's not a big deal for my application.

Verilog shift extending result?

We have the following line of code and we know that regF is 16 bits long, regD is 8 bits long and regE is 8 bits long, regC is 3 bits long and assumed unsigned:
regF <= regF + ( ( regD << regC ) & { 16{ regE [ regC ]} }) ;
My question is : will the shift regD << regC assume that the result is 8 bits or will it extended to 16 bits because of the bitwise & with the 16 bit vector?
The shift sub-expression itself has a width of 8 bits; the bit width of a shift is always the bit width of the left operand (see table 5-22 in the 2005 LRM).
However, things get more complicated after that. The shift sub-expression appears as an operand of the & operator. The bit length of the & expression is the bit-length of the largest of the 2 operands; in this case, 16 bits.
This sub-expression now appears as an operand of the + expression; the result width of this expression is again the maximum width of the two operands of the +, which is again 16.
We now have an assignment. This is not technically an operand, but the same rules are used; in this case, the LHS is also 16 bits, so the size of the RHS is unaffected.
We now know that the overall expression size is 16 bits; this size is propagated back down to the operands, except the 'self-determined' operands. The only self-determined operand here is the RHS of the shift expression (regC), which isn't extended.
The signedness of the expressions is now determined. Propagation happens in the same way. The overall effect here, since we have at least one unsigned operand, is that the expression is unsigned, and all operands are coerced to unsigned. So, all (non-self-determined) operands are coerced to unsigned 16-bit before any operation is actually carried out.
So, in other words, the shift sub-expression actually ends up as a 16-bit shift, even though it appears to be 8-bit at first sight. Note that it's not 16-bit because the RHS of the & is 16-bit, but because the entire sizing process - the width propagation up the expression - came up with an answer of 16. If you'd assigned to an 18-bit reg, instead of the 16-bit regF, then your shift would have been extended to 18 bits.
This is all very complicated and non-intuitive, at least if you have any experience of mainstream languages. It's explained (more or less) in sections 5.4 and 5.5 of the 2005 LRM. If you want any advice, then never write expressions like this. Write defensively - break everything down to individual sub-expressions, and then combine the sub-expressions.

Resources