I'm trying to implement COS X function in Verilog using Taylor series. The problem statement presented to me is as below
"Write a Verilog code to compute cosX using Taylor series approximation. Please attach the source and test bench code of the 8-bit outputs in signed decimal radix format for X = 0° to 360° at the increment of 10° "
I need to understand a couple of things before i proceed. Please correct me if i am wrong someplace
Resolution calculation : 10° increments to cover 0° to 360° => 36 positions
36 in decimal can be represented by 6 bits. Since we can use 6 bits, the resolution is slightly better by using 64 words. The 64 words represent 0° to 360° hence each word represents a resolution of 5.625° ie all values of Cos from 0° to 360° in increments of 5.625°. Thus resolution is 5.625°
Taylor series calculation
Taylor series for cos is given by Cos x approximation by Taylor series
COS X = 1 − (X^2/2!) + (X^4/4!) − (X^6/6!) ..... (using only 3~4 terms)
I have a couple of queries
1) While it is easy to generate X*X (X square) or X cube terms using a multiplier, i am not sure how to deal with the extra bits generated during calculation of X square or X cube terms . Output is 8 bits only
eg X=6 bits ; X square =12 bits ; X cube = 18 bits.
Do i generate them anyways and later ignore them by considering just the MSB 8 bits of the entire result ? ... such a cos wave would suck right ?
2) I am not sure how to handle the +1 addition at start of Taylor series ...COS X = 1 − (X^2/2!) + (X^4/4!) .... Do i add binary 1 directly or do i have to scale the 1 as 2^8 = 255 or 2^6 = 64 since i am using 6 bits at input and 8 bits at output ?
I think this number series normally gives a number in the range +1 to -1. SO you have to decide how you are going to use your 8 bits.
I think a signed number with 1 integer bit and 7 fractional bits, you will not be able to represent 1, but very close.
I have a previous answer explaining how to use fixed-point with verilog. Once your comfortable with that you need to look at how bit growth occurs during multiply.
Just because you are outputting 1 bit int, 7 bit frac internally you could (should) use more to compute the answer.
With 7 fractional bits a 1 integer would look like 9'b0_1_0000000 or 1*2**7.
Related
I don't understand how integer_decode in num_traits works. For instance: we have
use num_traits::Float;
let num = 2.0f32;
// (8388608, -22, 1)
let (mantissa, exponent, sign) = Float::integer_decode(num);
But how we get those integers?
Binary representation of 2.0f32 has 0 sign bit, 1 bit as leading bit in exponent and mantissa consisting of zeros. How to get integer decode and why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent?
I didn't write the function, so take this answer with a grain of salt, as it's more of a gut feeling than knowledge. The rationale behind it is not explained in the comments in the function implementation, so unless the author of the code speaks up, we can't deliver more than educational guesses.
f32 is based on IEEE-754, which specifies that a 2.0 shall be represented as the following three parts:
the sign bit 0
indicates that 2.0 is positive
the exponent 128
it's one byte that indicates the exponent, with 127 representing 0. 128 means 1
the mantissa 0
the mantissa consists of 23 bits and has an implicit 1. in front of it. So 0 means 1.0.
To get the actual number, you need to do 0 * (-1) + 2 ^ (128 - 127) * 1.0, which is 2 ^ 1 * 1 = 2.
Now this is not the only way to compute that. You could also do:
map the sign bit to 1 and -1
instead of prefixing the mantissa with 1., add a 1 in front of it, making it an integer. (this avoids having to use a float to decode a float, which is nonsense for obvious reasons)
subtract 127 from the exponent, making it signed. Then, remove 23 from it to compensate that our mantissa is now shifted by 23 bits (because the mantissa is 23 bits long and we moved the comma all the way back to make it an integer).
This would, for 2.0 give us:
sign -1
mantissa 0b100000000000000000000000 = 8388608
exponent 128 - 127 - 23 = -22
Now we can do sign * mantissa * 2 ^ exponent, as specified in the documentation to get our value back.
Note how fast calculating those integers was: a binary decision for the sign, a binary or operation for the mantissa and a single u8 subtraction for the exponent (a single one because one can combine - 127 - 23 to - 150 beforehand).
why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent
The short version is that this guarantees that all possible mantissas can be treated the same way. It's 23 bits long and a 1 with the entire mantissa attached to it is always a valid integer. In the case of 0 this is a 1 with 23 0s, 0b100000000000000000000000, which is 8388608.
integer_decode() documentation is quite clear:
Returns the mantissa, base 2 exponent, and sign as integers, respectively. The original number can be recovered by sign * mantissa * 2 ^ exponent.
1 * 8388608 * 2^-22 == 2
Mostly IEEE 754 uses this description of finite floating-point numbers: A floating-point number has the form (−1)s×be×m, where:
s is 0 or 1.
e is any integer emin ≤ e ≤ emax.
m is a number represented by a digit string of the form d0.d1d2…dp−1 where di is an integer digit 0 ≤ di < b.
This form is described in IEEE-754 2019 clause 3.3. b is the base, p is the precision (number of digits in base b), and emin and emax are bounds on the exponent. This form is useful for certain things, such as describing the normalized form as starting with a leading “1.” or “0.” when the base is two. However, the standard also says, in the same clause:
It is also convenient for some purposes to view the significand as an 10 integer; in which case the finite floating-point numbers are described thus:
Signed zero and non-zero floating-point numbers of the form (−1)s×bp×c, where
s is 0 or 1.
q is any integer emin ≤ q+p−1 ≤ emax.
c is a number represented by a digit string of the form d0d1d2…dp−1 where di is an integer digit 0 ≤ di ≤ b (c is therefore an integer with 0 ≤ c < bp).
(My reproduction of the IEEE-754 text changes some of the typography slightly.) Note two things. First, this matches the results Float::integer_decode gives you. In the Float format, p is 24, so m should have 24 bits, not 25, so it can be 8,388,608 (223) and cannot be 16,777,216 (224).
Second, what makes this form useful is m is always an integer and can be any integer in that range—the low digit of m is immediately to the left of the radix point, so the consecutive values representable in this range of the floating-point format are consecutive integers, and we can analyze them and write proofs using number theory.
You could use alternate forms that are mathematically equivalent but let the significand be 224, but then the low digit in such a form would have to be zero (because there is no bit in the format to represent it having any other value), so it is not particularly useful.
For a 64-bit processor, the size of registers are 64 bits. So the maximum bits that can be multiplied together at a time are: a 32 bit number by a 32 bit number.
Is this true or are there any other factors that determine the number of bits multiplied by each other?
No - you can not say that.
That depends on how the algorithm of multiplication is designed.
Think about that:
x * y = x + x + x + ... + x // y times
You can think of every multiplication as a sum of additions.
Adding a number to another one is not a thing of registers because you add always two figurs, save the result and the carry-over and go on to the next two figurs till you added the two numbers.
This way you can multiply very long numbers with a very small register but a large memory to save the result.
Now, floating and double-precision numbers, although they can approximate any sort of number (although the same could be said integers, floats are just more precise), they are represented as binary decimals internally. For example, one tenth would be approximated
0.00011001100110011... (... only goes to computers precision, not infinity)
Now, any number in binary with finite bits as something called a dyadic fraction representation in mathematics (has nothing to do with p-adic). This means you represent it as a fraction, where the denominator is a power of 2. For example, let's say our computer approximates one tenth as 0.00011. The dyadic fraction for that is 3/32 or 3/(2^5), which is close to one tenth. Now for my technical question. What would be the simplest way to extract the dyadic fraction from a floating number.
Irrelevant Note: If you are wondering why I would want to do this, it is because I am creating a surreal number library in Haskell. Dyadic fractions are easily translated into Surreal numbers, which is why it is convenient that binary is easily translated into dyadic, (I'll sure have trouble with the rational numbers though.)
The decodeFloat function seems useful for this. Technically, you should also check that floatRadix is 2, but as far I can see this is always the case in GHC.
Just be careful since it does not simplify mantissa and exponent. Here, if I evaluate decodeFloat (1.0 :: Double) I get an exponent of -52 and a mantissa of 2^52 which is not what I expected.
Also, toRational seems to generate a dyadic fraction. I am not sure this is always the case, though.
Hold your numbers in binary and convert to decimal for display.
Binary numbers are all dyatic. The numbers after the decimal place represent the number of powers of two for the denominator and the number evaluated without a decimal place is the numerator. That's binary numbers for you.
There is an ideal representation for surreal numbers in binary. I call them "sinary". It's this:
0s is Not a number
1s is zero
10s is neg one
11s is one
100s is neg two
101s is neg half
110s is half
111s is two
... etc...
so you see that the standard binary count matches the surreal birth order of numeric values when evaluated in sinary. The way to determine the numeric value of sinary is that the 1's are rights and the 0's are lefts. We start with +/-1's and then 1/2, 1/4, 1/8, etc. With sign equal to + for 1 and - for 0.
ex: evaluating sinary
1011011s
-> is the 91st surreal number (because 64+16+8+2+1 = 91)
-> with a value of −0.28125, because...
1011011
NLRRLRR
+-++-++
+ 0 − 1 + 1/2 + 1/4 − 1/8 + 1/16 + 1/32
= 0 − 32/32 + 16/32 + 8/32 − 4/32 + 2/32 + 1/32
= − 9/32
The surreal numbers form a binary tree, so there is an ideal binary format matching their location on the tree according to the Left/Right pattern to reach the number. Assign 1 to right and 0 to left. Then the birth order of surreal number is equal to the binary count of this representation. ie: the 15th surreal number value represented in sinary is the 15th number representation in the standard binary count. The value of a sinary is the surreal label value. Strip the leading bit from the representation, and start adding +1's or -1's depending on if the number starts with 1 or 0 after the first one. Then once the bit flips, begin adding and subtracting halved values (1/2, 1/4, 1/8, etc) using + or - values according to the bit value 1/0.
I have tested this format and it seems to work well. And there are some other secrets... such as the left and right of any sinary representation is the same binary format with the tail clipped to the last 0 and last 1 respectively. Conversion to decimal into a dyatic is NOT required in order to preform the recursive functions requested by Conway.
I am doing a divider circuit in verilog and using the non-restoring division algorithm.
I am having trouble representing the remainder as a fractional binary number.
For example if I do 0111/0011 (7/3) I get the quotient as 0010 and remainder as 0001 which is correct but I want to represent it as 0010.0101.
Can Someone help ??
Suppose, as in your example, you are dividing 4 bit numbers, but want an extra 4 bits of fractional precision in the result.
One approach is to simply multiply the numerator by 2^4 before doing the division.
i.e.
instead of
0111/0011 = 0010 (+remainder)
do
01110000/0011 = 00100101 (+remainder)
hi just do mathematics !!!
you have already got the Q(quotient) and R(remainder) , now with the remainder you multiply that with 10(decimal) in binary 1010 that for example
7/3 gives 2 as Q and 1 as remainder than just multiply this 1 with 10 then again apply your logic which gives 10/3 gives 3 as Q so your answer will be
3(Q(first_division)).3(second division-Q)
try it it is working . and very easy to implement in verilog .
I've written a small function in C, which almost do the same work as standart function `fcvt'. As you may know, this function takes a float/double and make a string, representing this number in ANSI characters. Everything works ;-)
For example, for number 1.33334, my function gives me string: "133334" and set up special integer variable `decimal_part', in this example will be 1, which means in decimal part only 1 symbol, everything else is a fraction.
Now I'm curious about what to do standart C function `printf'. It can take %a or %e as format string. Let me cite for %e (link junked):
"double" argument is output in scientific notation
[-]m.nnnnnne+xx
... The exponent always contains two digits.
It said: "The exponent always contains two digits". But what is an Exponent? This is the main question. And also, how to get this 'exponent' from my function above or from `fcvt'.
The notation might be better explained if we expand the e:
[-]m.nnnnnn * (10^xx)
So you have one digit of m (from 0 to 9, but it will only ever be 0 if the entire value is 0), and several digits of n. I guess it might be best to show with examples:
1 = 1.0000 * 10^0 = 1e0
10 = 1.0000 * 10^1 = 1e1
10000 = 1.0000 * 10^4 = 1e4
0.1 = 1.0000 * 10^-1 = 1e-1
1,419 = 1.419 * 10^3 = 1.419e3
0.00000123 = 1.23 * 10^-5 = 1.23e-5
You can look up scientific notation off Google, but it is useful for expressing very large or small numbers like 1232100000000000000 would be 1.2321e24 (I didn't actually count, exponent may be inaccurate).
In C, I think you can actually extract the exponent from the top 12 bits (the first being the sign which you will have to ignore). See: IEEE758-1985 Floating Point
The exponent is the power 10 is raised to then multiplied by the base.
SI is explained at wikipeida. http://en.wikipedia.org/wiki/Scientific_notation
m.nnnnnne+xx is logically equal to m.nnnnnn * 10 ^ +xx
In scientific notation, the exponent is the ten to the XX power, so 1234.5678 can be represented as 1.2345678E03 where the normalized form is multiplied by 10^3 to get the "real" answer.
400 = 4 * 10 ^ 2
2 is the exponent.
If you write a number in scientific notation then the exponent is part of that notation.
You can see a full description here http://en.wikipedia.org/wiki/Scientific_notation, but basically its just another way to write a number, typically used for very large or very small numbers.
Say you have the number 300, that is equal to 3 * 100, or 3 * 10^2 in scientific notation.
If you use %e it will be printed as 3.0e+02