Maximum bit-width to store a summation of M n-bit binary numbers - verilog

I am trying to find the formula to calculate the maximum bit-width required to contain a sum of M n-bit unsigned binary numbers. Thanks!

The maximum bit-width needed should be ceil(log_2(M * (2^n - 1))).
Edit: Thanks to #MBurnham I realize now that it should be floor(log_2(M * (2^n - 1))) + 1 instead.

Assuming positive integers, you need floor(log2(x)) + 1 bits to store x. and the largest value the sum of m n-bit numbers can produce would be m * 2^n.
So I believe the formula should be
floor(log2(m * 2^n)) + 1
bits.

If I add 2 numbers the I need 1 bit more than the wider of the 2 numbers to store the result. So, if I add 2 n-bit numbers, I need n+1 bits to store the result.
if I add another n-bit number, I need (n+1)+1 bits to store the result (that's 3 n-bit numbers added so far)
if I add another n-bit number, I need ((n+1)+1)+1 bits to store the result (that's 4 n-bit numbers added so far)
if I add another n-bit number, I need (((n+1)+1)+1)+1 bits to store the result (that's 5 n-bit numbers added so far)
So, I think your formula is
n + M - 1

Related

Why does integer_decode in Rust work that way?

I don't understand how integer_decode in num_traits works. For instance: we have
use num_traits::Float;
let num = 2.0f32;
// (8388608, -22, 1)
let (mantissa, exponent, sign) = Float::integer_decode(num);
But how we get those integers?
Binary representation of 2.0f32 has 0 sign bit, 1 bit as leading bit in exponent and mantissa consisting of zeros. How to get integer decode and why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent?
I didn't write the function, so take this answer with a grain of salt, as it's more of a gut feeling than knowledge. The rationale behind it is not explained in the comments in the function implementation, so unless the author of the code speaks up, we can't deliver more than educational guesses.
f32 is based on IEEE-754, which specifies that a 2.0 shall be represented as the following three parts:
the sign bit 0
indicates that 2.0 is positive
the exponent 128
it's one byte that indicates the exponent, with 127 representing 0. 128 means 1
the mantissa 0
the mantissa consists of 23 bits and has an implicit 1. in front of it. So 0 means 1.0.
To get the actual number, you need to do 0 * (-1) + 2 ^ (128 - 127) * 1.0, which is 2 ^ 1 * 1 = 2.
Now this is not the only way to compute that. You could also do:
map the sign bit to 1 and -1
instead of prefixing the mantissa with 1., add a 1 in front of it, making it an integer. (this avoids having to use a float to decode a float, which is nonsense for obvious reasons)
subtract 127 from the exponent, making it signed. Then, remove 23 from it to compensate that our mantissa is now shifted by 23 bits (because the mantissa is 23 bits long and we moved the comma all the way back to make it an integer).
This would, for 2.0 give us:
sign -1
mantissa 0b100000000000000000000000 = 8388608
exponent 128 - 127 - 23 = -22
Now we can do sign * mantissa * 2 ^ exponent, as specified in the documentation to get our value back.
Note how fast calculating those integers was: a binary decision for the sign, a binary or operation for the mantissa and a single u8 subtraction for the exponent (a single one because one can combine - 127 - 23 to - 150 beforehand).
why we choose this particular decomposition and not 8388608*2 as mantissa and -23 as exponent
The short version is that this guarantees that all possible mantissas can be treated the same way. It's 23 bits long and a 1 with the entire mantissa attached to it is always a valid integer. In the case of 0 this is a 1 with 23 0s, 0b100000000000000000000000, which is 8388608.
integer_decode() documentation is quite clear:
Returns the mantissa, base 2 exponent, and sign as integers, respectively. The original number can be recovered by sign * mantissa * 2 ^ exponent.
1 * 8388608 * 2^-22 == 2
Mostly IEEE 754 uses this description of finite floating-point numbers: A floating-point number has the form (−1)s×be×m, where:
s is 0 or 1.
e is any integer emin ≤ e ≤ emax.
m is a number represented by a digit string of the form d0.d1d2…dp−1 where di is an integer digit 0 ≤ di < b.
This form is described in IEEE-754 2019 clause 3.3. b is the base, p is the precision (number of digits in base b), and emin and emax are bounds on the exponent. This form is useful for certain things, such as describing the normalized form as starting with a leading “1.” or “0.” when the base is two. However, the standard also says, in the same clause:
It is also convenient for some purposes to view the significand as an 10 integer; in which case the finite floating-point numbers are described thus:
Signed zero and non-zero floating-point numbers of the form (−1)s×bp×c, where
s is 0 or 1.
q is any integer emin ≤ q+p−1 ≤ emax.
c is a number represented by a digit string of the form d0d1d2…dp−1 where di is an integer digit 0 ≤ di ≤ b (c is therefore an integer with 0 ≤ c < bp).
(My reproduction of the IEEE-754 text changes some of the typography slightly.) Note two things. First, this matches the results Float::integer_decode gives you. In the Float format, p is 24, so m should have 24 bits, not 25, so it can be 8,388,608 (223) and cannot be 16,777,216 (224).
Second, what makes this form useful is m is always an integer and can be any integer in that range—the low digit of m is immediately to the left of the radix point, so the consecutive values representable in this range of the floating-point format are consecutive integers, and we can analyze them and write proofs using number theory.
You could use alternate forms that are mathematically equivalent but let the significand be 224, but then the low digit in such a form would have to be zero (because there is no bit in the format to represent it having any other value), so it is not particularly useful.

Multiplication of numbers that have sizes greater than a 64-bit register

For a 64-bit processor, the size of registers are 64 bits. So the maximum bits that can be multiplied together at a time are: a 32 bit number by a 32 bit number.
Is this true or are there any other factors that determine the number of bits multiplied by each other?
No - you can not say that.
That depends on how the algorithm of multiplication is designed.
Think about that:
x * y = x + x + x + ... + x // y times
You can think of every multiplication as a sum of additions.
Adding a number to another one is not a thing of registers because you add always two figurs, save the result and the carry-over and go on to the next two figurs till you added the two numbers.
This way you can multiply very long numbers with a very small register but a large memory to save the result.

How to count binary sequence in binary number in Python?

I would like to count '01' sequence in 5760 binary bits.
First, I would like to combine several binary numbers then count # of '01' occurrences.
For example, I have 64 bits integer. Say, 6291456. Then I convert it into binary. Most significant 4 bits are not used. So I'll get 60 bits binary 000...000011000000000000000000000
Then I need to combine(just put bits together since I only need to count '01') first 60 bits + second 60 bits + ...so 96 of 60 bits are stitched together.
Finally, I want to count how many '01' appears.
s = binToString(5760 binary bits)
cnt = s.count('01');
num = 6291226
binary = format(num, 'b')
print(binary)
print(binary.count('01'))
If I use number given by you i.e 6291456 it's binary representation is 11000000000000000000000 which gives 0 occurrences of '01'.
If you always want your number to be 60 bits in length you can use
binary = format(num,'060b')
It will add leading 0 to make it of given length
Say that nums is your list of 96 numbers, each of which can be stored in 64 bits. Since you want to throw away the most 4 significant bits, you are really taking the number modulo 2**60. Thus, to count the number of 01 in the resulting string, using the idea of #ShrikantShete to use the format function, you can do it all in one line:
''.join(format(n%2**60,'060b') for n in nums).count('01')

Lookup table for counting number of set bits in an Integer

Was trying to solve this popular interview question - http://www.careercup.com/question?id=3406682
There are 2 approaches to this that i was able to grasp -
Brian Kernighan's algo -
Bits counting algorithm (Brian Kernighan) in an integer time complexity
Lookup table.
I assume when people say use a lookup table, they mean a Hashmap with the Integer as key, and the count of number of set bits as value.
How does one construct this lookup table? Do we use Brian's algo to to count the number of bits the first time we encounter an integer, put it in hashtable, and next time we encounter that integer, retrieve value from hashtable?
PS: I am aware of the hardware and software api's available to perform popcount (Integer.bitCount()), but in context of this interview question, we are not allowed to use those methods.
I was looking for Answer everywhere but could not get the satisfactory explanation.
Let's start by understanding the concept of left shifting. When we shift a number left we multiply the number by 2 and shifting right will divide it by 2.
For example, if we want to generate number 20(binary 10100) from number 10(01010) then we have to shift number 10 to the left by one. we can see number of set bit in 10 and 20 is same except for the fact that bits in 20 is shifted one position to the left in comparison to number 10. so from here we can conclude that number of set bits in the number n is same as that of number of set bit in n/2(if n is even).
In case of odd numbers, like 21(10101) all bits will be same as number 20 except for the last bit, which will be set to 1 in case of 21 resulting in extra one set bit for odd number.
let's generalize this formual
number of set bits in n is number of set bits in n/2 if n is even
number of set bits in n is number of set bit in n/2 + 1 if n is odd (as in case of odd number last bit is set.
More generic Formula would be:
BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
where BitsetTable256 is table we are building for bit count. For base case we can set BitsetTable256[0] = 0; rest of the table can be computed using above formula in bottom up approach.
Integers can directly be used to index arrays;
e.g. so you have just a simple array of unsigned 8bit integers containing the set-bit-count for 0x0001, 0x0002, 0x0003... and do a look up by array[number_to_test].
You don't need to implement a hash function to map an 16 bit integer to something that you can order so you can have a look up function!
To answer your question about how to compute this table:
int table[256]; /* For 8 bit lookup */
for (int i=0; i<256; i++) {
table[i] = table[i/2] + (i&1);
}
Lookup this table on every byte of the given integer and sum the values obtained.

How to extract dyadic fraction from float

Now, floating and double-precision numbers, although they can approximate any sort of number (although the same could be said integers, floats are just more precise), they are represented as binary decimals internally. For example, one tenth would be approximated
0.00011001100110011... (... only goes to computers precision, not infinity)
Now, any number in binary with finite bits as something called a dyadic fraction representation in mathematics (has nothing to do with p-adic). This means you represent it as a fraction, where the denominator is a power of 2. For example, let's say our computer approximates one tenth as 0.00011. The dyadic fraction for that is 3/32 or 3/(2^5), which is close to one tenth. Now for my technical question. What would be the simplest way to extract the dyadic fraction from a floating number.
Irrelevant Note: If you are wondering why I would want to do this, it is because I am creating a surreal number library in Haskell. Dyadic fractions are easily translated into Surreal numbers, which is why it is convenient that binary is easily translated into dyadic, (I'll sure have trouble with the rational numbers though.)
The decodeFloat function seems useful for this. Technically, you should also check that floatRadix is 2, but as far I can see this is always the case in GHC.
Just be careful since it does not simplify mantissa and exponent. Here, if I evaluate decodeFloat (1.0 :: Double) I get an exponent of -52 and a mantissa of 2^52 which is not what I expected.
Also, toRational seems to generate a dyadic fraction. I am not sure this is always the case, though.
Hold your numbers in binary and convert to decimal for display.
Binary numbers are all dyatic. The numbers after the decimal place represent the number of powers of two for the denominator and the number evaluated without a decimal place is the numerator. That's binary numbers for you.
There is an ideal representation for surreal numbers in binary. I call them "sinary". It's this:
0s is Not a number
1s is zero
10s is neg one
11s is one
100s is neg two
101s is neg half
110s is half
111s is two
... etc...
so you see that the standard binary count matches the surreal birth order of numeric values when evaluated in sinary. The way to determine the numeric value of sinary is that the 1's are rights and the 0's are lefts. We start with +/-1's and then 1/2, 1/4, 1/8, etc. With sign equal to + for 1 and - for 0.
ex: evaluating sinary
1011011s
-> is the 91st surreal number (because 64+16+8+2+1 = 91)
-> with a value of −0.28125, because...
1011011
NLRRLRR
+-++-++
+ 0 − 1 + 1/2 + 1/4 − 1/8 + 1/16 + 1/32
= 0 − 32/32 + 16/32 + 8/32 − 4/32 + 2/32 + 1/32
= − 9/32
The surreal numbers form a binary tree, so there is an ideal binary format matching their location on the tree according to the Left/Right pattern to reach the number. Assign 1 to right and 0 to left. Then the birth order of surreal number is equal to the binary count of this representation. ie: the 15th surreal number value represented in sinary is the 15th number representation in the standard binary count. The value of a sinary is the surreal label value. Strip the leading bit from the representation, and start adding +1's or -1's depending on if the number starts with 1 or 0 after the first one. Then once the bit flips, begin adding and subtracting halved values (1/2, 1/4, 1/8, etc) using + or - values according to the bit value 1/0.
I have tested this format and it seems to work well. And there are some other secrets... such as the left and right of any sinary representation is the same binary format with the tail clipped to the last 0 and last 1 respectively. Conversion to decimal into a dyatic is NOT required in order to preform the recursive functions requested by Conway.

Resources