About hash table - hashmap

A hash function h maps 16-bit inputs to 8-bit hash values. What is the largest k such that in any set of
1,000 inputs, there are at least k inputs that h maps to the same hash value?
I think k should be 3. Because 1000/256=3.~
However the answer key is 4. It is a GRE exam so I guess its answer is correct. Could any one help me explain it?

It is 4 because you need to round up here. There are 256 possible hash outputs; if you get at most 3 times each output value, then you have at most 256*3 = 768 inputs. So the answer is 1000/256, rounded up, hence 4.

Related

Finding duplicate value for calculated f(x) unique values of 2**128 except 2 duplicated

I have a function that calculates the next unique value on the fly. These values are 100% unique except one. Each value is 256 bit, total values are 2128. The amount of space needed to store this many values is obviously too large: (2128 x 256 bits)
How can i find these 2 duplicates indexes from starting point?
I tried to XOR each value with next, but when function returns duplicate i cant figure out that this is duplicate because XOR excludes it from crc.
simple example:
a=[755,167,986,343,566,996,245,343]
crc=a[0]
for i in range(1,len(a)):
crc=crc ^ a[i]
this code makes crc like
crc([755,167,986,566,996,245])
right answer for this example is 343 or #4 (3 if counting from 0)
best code that i can produce is double for from ending element till start, and from current element of main for till end:
for i in range(maxX,1,-1):
crc=crc^x1
for j in range(i+1,maxX):
if(x1==x2):
break
x2 = f(x2)#iterate x2 back like "f(x2,+1)"
x1=f(x1) #iterate x1 forward like "f(x1,-1)"
Are there any subexponential algorithms to solve this problem?
function produces values like :
f(x1)=13082069741720843297365566874415150979823939205374148607908286415326657186361
in fact there is no difference for me coding python/c/javascript/php/asm, so the code for answer accepted in any language
Any fastest algorithm for finding 2 duplicates for this?

Why does this hash calculating bit hack work?

For practice I've implemented the qoi specification in rust. In it there is a small hash function to store recently used pixels:
index_position = (r * 3 + g * 5 + b * 7 + a * 11) % 64
where r, g, b, and a are the red, green, blue and alpha channels respectively.
I assume this works as a hash because it creates a unique prime factorization for the numbers with the mod to limit the number of bytes. Anyways I implemented it naively in my code.
While looking at other implementations I came across this bit hack to optimize the hash calculation:
fn hash(rgba:[u8:4]) -> u8 {
let v = u32::from_ne_bytes(rgba);
let s = (((v as u64) << 32) | (v as u64)) & 0xFF00FF0000FF00FF;
s.wrapping_mul(0x030007000005000Bu64.to_le()).swap_bytes() as u8 & 63
}
I think I understand most of what's going on but I'm confused about the magic number (the multiplicand). To my understanding it should be flipped. As a step by step example:
let rgba = [0x12, 0x34, 0x56, 0x78].
On my machine (little endian) this gives v the value 0x78563412.
The bit shifting spreads the values, giving s = 0x7800340000560012.
Now here's where I get confused. The magic number has the values that should be multiplied aligned in a 64 bit field (3, 5, 7, 11), spaced the same way that the original values are. However they seem to be in reverse order from the values:
0x7800340000560012
0x030007000005000B
When multiplying it would seem that the highest value, the alpha channel (0x78), is being multiplied by 3, while the lowest value, the red channel (0x12), is being multiplied by 11. I'm also not entirely sure why this multiplication works anyway, after multiplying the values by various powers of 2.
I understand that the bytes are then swapped to big endian and trimmed, but that's not until after the multiplication step which loses me.
I know that the code produces the correct hash, but I don't understand why that's the case. Can anyone explain to me what I'm missing?
If you think about the way the math works, you want this flipped order, because it means all the results from each of the "logical" multiplications cluster in the same byte. The highest byte in the first value multiplied by the lowest byte in the second produces a result in the highest byte. The lowest byte in the first value's product with the highest byte in the second value produces a result in the same highest byte, and the same goes for the intermediate bytes.
Yes, the 0x78... and 0x03... are also multiplied by each other, but they overflow way past the top of the value and are lost. Having the order "backwards" means the result of the multiplications we care about all ends up summed in the uppermost byte (the total shift of the results we want is always 56 bits, because the 56th bit offset value is multiplied by the 0th, the 40th by the 16th, the 16th by the 40th, and the 0th by the 56th), with the rest of the multiplications we don't want having their results either overflow (and being lost) or appearing in lower bytes (which we ignore). If you flipped the bytes in the second value, the 0x78 * 0x0B (alpha value & multiplier) component would be lost to overflow, while the 0x12 * 0x03 (red value & multiplier) component wouldn't reach the target byte (every component we cared about would end up somewhat that wasn't the uppermost byte).
For a possibly more intuitive example, imagine doing the same work, but where all the bytes of one input except a single component are zero. If you multiply:
0x7800000000000000 * 0x030007000005000B
the logical result is:
0x1680348000258052800000000000000
but removing the overflow reduces that to:
0x2800000000000000
//^^ result we care about (actual product of 0x78 and 0x0B is 0x528, but only keeping low byte)
Similarly,
0x0000340000000000 * 0x030007000005000B
produces:
0x9c016c000104023c0000000000
overflowing to:
0x04023c0000000000
//^^ result we care about (actual product of 0x34 and 0x5 was 0x104, but only 04 kept)
In that case, the other multiplications did leave data in result (not all overflowed), but since we only look at the high byte, the rest gets ignored.
If you keep doing this math step by step and adding the results, you'll find that the high byte ends up the correct answer to the four individual multiplications you expected (mod 256); flip the order, and it won't work out that way.
The advantage to putting all the results in that high byte is that it allows you to use swap_bytes to move it cheaply to the low byte, and read the value directly (no need to even mask it on many architectures).

Maximum bit-width to store a summation of M n-bit binary numbers

I am trying to find the formula to calculate the maximum bit-width required to contain a sum of M n-bit unsigned binary numbers. Thanks!
The maximum bit-width needed should be ceil(log_2(M * (2^n - 1))).
Edit: Thanks to #MBurnham I realize now that it should be floor(log_2(M * (2^n - 1))) + 1 instead.
Assuming positive integers, you need floor(log2(x)) + 1 bits to store x. and the largest value the sum of m n-bit numbers can produce would be m * 2^n.
So I believe the formula should be
floor(log2(m * 2^n)) + 1
bits.
If I add 2 numbers the I need 1 bit more than the wider of the 2 numbers to store the result. So, if I add 2 n-bit numbers, I need n+1 bits to store the result.
if I add another n-bit number, I need (n+1)+1 bits to store the result (that's 3 n-bit numbers added so far)
if I add another n-bit number, I need ((n+1)+1)+1 bits to store the result (that's 4 n-bit numbers added so far)
if I add another n-bit number, I need (((n+1)+1)+1)+1 bits to store the result (that's 5 n-bit numbers added so far)
So, I think your formula is
n + M - 1

Lookup table for counting number of set bits in an Integer

Was trying to solve this popular interview question - http://www.careercup.com/question?id=3406682
There are 2 approaches to this that i was able to grasp -
Brian Kernighan's algo -
Bits counting algorithm (Brian Kernighan) in an integer time complexity
Lookup table.
I assume when people say use a lookup table, they mean a Hashmap with the Integer as key, and the count of number of set bits as value.
How does one construct this lookup table? Do we use Brian's algo to to count the number of bits the first time we encounter an integer, put it in hashtable, and next time we encounter that integer, retrieve value from hashtable?
PS: I am aware of the hardware and software api's available to perform popcount (Integer.bitCount()), but in context of this interview question, we are not allowed to use those methods.
I was looking for Answer everywhere but could not get the satisfactory explanation.
Let's start by understanding the concept of left shifting. When we shift a number left we multiply the number by 2 and shifting right will divide it by 2.
For example, if we want to generate number 20(binary 10100) from number 10(01010) then we have to shift number 10 to the left by one. we can see number of set bit in 10 and 20 is same except for the fact that bits in 20 is shifted one position to the left in comparison to number 10. so from here we can conclude that number of set bits in the number n is same as that of number of set bit in n/2(if n is even).
In case of odd numbers, like 21(10101) all bits will be same as number 20 except for the last bit, which will be set to 1 in case of 21 resulting in extra one set bit for odd number.
let's generalize this formual
number of set bits in n is number of set bits in n/2 if n is even
number of set bits in n is number of set bit in n/2 + 1 if n is odd (as in case of odd number last bit is set.
More generic Formula would be:
BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
where BitsetTable256 is table we are building for bit count. For base case we can set BitsetTable256[0] = 0; rest of the table can be computed using above formula in bottom up approach.
Integers can directly be used to index arrays;
e.g. so you have just a simple array of unsigned 8bit integers containing the set-bit-count for 0x0001, 0x0002, 0x0003... and do a look up by array[number_to_test].
You don't need to implement a hash function to map an 16 bit integer to something that you can order so you can have a look up function!
To answer your question about how to compute this table:
int table[256]; /* For 8 bit lookup */
for (int i=0; i<256; i++) {
table[i] = table[i/2] + (i&1);
}
Lookup this table on every byte of the given integer and sum the values obtained.

hashing string to an int between 0-19

I was wondering how I would hash a string value (ex: "myObjectName") to int values between 0-19
I'm guaranteed to have no more than 20 unique string values.
Thanks
Adding my comment as an answer as suggested:
I would suggest that hashing isn't the exact path you should follow here.
One method would be using a dictionary (like the built in data structure in Python) that has a key-value pair of your string and a number from 1-20 (or 0 - 19)
As you read or see each string, you could check to see if a dictionary entry exists, if so, do whatever needs to be done, if not, create a new dictionary entry with the next available number (generated by looking at the number of existing entries in the dictionary).
You could use any sort of hashing you like, but in this case, you could do with adding up the ASCII values (or unicode code point, if you like) of the characters, and apply modulo 20 to the result. It will give you a number from 0 to 19.
But this is nog guaranteed to result in a number that uniquely identifies your 20 strings. No hashing algorithm will guarantee that hashing a collection of 20 random strings will result in a unique code for each string..
Do md5 sum, convert to number and do modulo 20. E.g. in PHP:
hexdec(substr(md5("hello"), 1, 8)) % 20
The substr() is needed so that the number can be converted to integer.

Resources