Why test for these numbers (2^16, 2^31 ....) - string

Going through Elisabeth Hendrickson's test heuristics cheatsheet , I see the following recommendations :
Numbers : 32768 (2^15) 32769 (2^15+ 1) 65536 (2^16) 65537 (2^16 +1) 2147483648 (2^31) 2147483649 (2^31+ 1) 4294967296 (2^32) 4294967297 (2^32+ 1)
Does someone know the reason for testing all theses cases ? My gut feeling goes with the data type the developer may have used ( integer, long, double...)
Similarly, with Strings :
Long (255, 256, 257, 1000, 1024, 2000, 2048 or more characters)

These represent boundaries
Integers
2^15 is at the bounds of signed 16-bit integers
2^16 is at the bounds of unsigned 16-bit integers
2^31 is at the bounds of signed 32-bit integers
2^32 is at the bounds of unsigned 32-bit integers
Testing for values close to common boundaries tests whether overflow is correctly handled (either arithmetic overflow in the case of various integer types, or buffer overflow in the case of long strings that might potentially overflow a buffer).
Strings
255/256 is at the bounds of numbers that can be represented in 8 bits
1024 is at the bounds of numbers that can be represented in 10 bits
2048 is at the bounds of numbers that can be represented in 11 bits
I suspect that the recommendations such as 255, 256, 1000, 1024, 2000, 2048 are based on experience/observation that some developers may allocate a fixed-size buffer that they feel is "big enough no matter what" and fail to check input. That attitude leads to buffer overflow attacks.

These are boundary values close to maximum signed short, maximum unsigned short and same for int. The reason to test them is to find bugs that occur close to the border values of typical data types.
E.g. your code uses signed short and you have a test that exercises something just below and just above the maximum value of such type. If the first test passes and the second one fails, you can easily tell that overflow/truncation on short was the reason.

Those numbers are border cases on either side of the fence (+1, 0, and -1) for "whole and round" computer numbers, which are always powers of 2. Those powers of 2 are also not random and are representing standard choices for integer precision - being 8, 16, 32, and so on bits wide.

Related

OverflowError: cannot fit 'int' into an index-sized integer while multiplying a string with a large int

`
k = int(input())
string = 'codeforce'
real = string + k*'s'
print(real)`
I'm trying to multiply a string 's' with a int 10^16.
But it gives me OverflowError: cannot fit 'int' into an index-sized integer
How can I get rid of it?The image shows all the code and error
The result of multplying a string by an integer k is that string repeated k times (whichever order the multiplication is). For example:
>>> 3 * "s"
'sss'
>>> "s" * 3
'sss'
In this case, you have requested a string of length 10^16. This would require 10 petabytes of virtual address space to store. Even if your python implementation in principle allowed you to create an object of that size, it is extremely unlikely that your machine's physical hardware limitations would allow it (even allowing for the use of swap space).
The exact maximum is likely to be implementation-dependent. For example, in python running on x86_64 Linux, an OverflowError is raised where k is 2^63 or more, i.e. when the length cannot be stored in a 64-bit signed long. For numbers less than that but where memory would nonetheless be exhausted, a MemoryError is raised instead.
In your Python implementation, the cutoff seems to be lower than 2^63 (which is approximately 9e18). It is therefore possible that a 32-bit signed int is being used as an "index-size integer", which would imply a maximum string length of 2GB. If that is the case, then this is within the amount of physical memory which is actually plausible on your system, in which case it is possible that there is a limit which actually matters. In that case, you might need to redesign any code in order to reduce the length of the strings used.

In PBKDF2 is INT (i) signed?

Page 11 of RFC 2898 states that for U_1 = PRF (P, S || INT (i)), INT (i) is a four-octet encoding of the integer i, most significant octet first.
Does that mean that i is a signed value and if so what happens on overflow?
Nothing says that it would be signed. The fact that dkLen is capped at (2^32 - 1) * hLen suggests that it's an unsigned integer, and that it cannot roll over from 0xFFFFFFFF (2^32 - 1) to 0x00000000.
Of course, PBKDF2(MD5) wouldn't hit 2^31 until you've asked for 34,359,738,368 bytes. That's an awful lot of bytes.
SHA-1: 42,949,672,960
SHA-2-256 / SHA-3-256: 68,719,476,736
SHA-2-384 / SHA-3-384: 103,079,215,104
SHA-2-512 / SHA-3-512: 137,438,953,472
Since the .NET implementation (in Rfc2898DeriveBytes) is an iterative stream it could be polled for 32GB via a (long) series of calls. Most platforms expose PBKDF2 as a one-shot, so you'd need to give them a memory range of 32GB (or more) to identify if they had an error that far out. So even if most platforms get the sign bit wrong... it doesn't really matter.
PBKDF2 is a KDF (key derivation function), so used for deriving keys. AES-256 is 32 bytes, or 48 if you use the same PBKDF2 to generate an IV (which you really shouldn't). Generating a private key for the ECC curve with a 34,093 digit prime is (if I did my math right) 14,157 bytes. Well below the 32GB mark.
i ranges from 1 to l = CEIL (dkLen / hLen), and dkLen and hLen are positive integers. Therefore, i is strictly positive.
You can, however, store i in a signed, 32-bit integer type without any special handling. If i rolls over (increments from 0x7FFFFFFF to 0xF0000000), it will continue to be encoded correctly, and continue to increment correctly. With two's complement encoding, bitwise results for addition, subtraction, and multiplication are the same as long as all values are treated as either signed or unsigned.

fixed point integer division ("fractional division") algorithm

The Honeywell DPS8 computer (and others) have/had a "divide fractional" instruction:
"This instruction divides a 71-bit fractional dividend (including sign) by a 36-bit
fractional divisor (including sign) to form a 36-bit fractional quotient (including
sign) and a 36-bit fractional remainder (including sign). Bit 35 of the remainder
corresponds to bit 70 of the dividend. The remainder sign is equal to the dividend
sign unless the remainder is zero."
So, as I understand it, this is integer division with the decimal point way over on the left.
.qqqqq / .ddddd
(I did scaled integer math in FORTH back in the day, but my memories of the techniques are lost in fog of time.)
To implement this instruction in a DPS8 emulator, I believe I need to start by creating two 70 bit numbers: the 71 bit dividend less it's sign bit, and the the 36 bit divisor less its sign bit and shifted 35 bits to the left so that the decimal points line up.
I think I can then form the remainder and quotient (in C) with '%' and '/', but I am unsure if those results need to be normalized (i.e. shifted).
I found an example of a "shift and subtract" algorithm "Computer Arithmetic", slide 10), but I would prefer a more straight forward implementation.
Am I on the right track, or is the solution more nuanced (fixing up the signs and detection of errors have been elided from here; those stages are well documented. The actual division is the issue.). Any pointers to C implementations of this kind of hardware emulation would be particularly helpful.
I do not have the definitive answer, but as a division is a division, you might find it helpful to look at some basic division routines.
Imagine that you have a 32-bit variable and you want an 8-bit fractional part.
You then have an integer part between 0 and 16777215, and a fractional part which is between 0 and 255.
0xiiiiiiff (where i is the integer part, f is the fractional part).
Imagine you have a 24-bit dividend (numerator), say the value 3, and a 24-bit divisor (denominator), say the value 13.
As we quickly will see, 3/13 is greater than zero and less than one. That means our fractional part is nonzero, but our integer part is filled completely with zeros.
So to do the above division using a standard divide function, we'll just bit-shift the dividend by N, thus we will get N bits of precision in our fractional part.
quotient_fp = (dividend_ip << 8) / divisor_ip
So far, so good.
But what if we want the divisor to have a fractional part, then ?
If we just shift the divisor up by 8, then we'll have a problem:
(dividend_ip << 8) / (divisor_ip << 8)
- because we'll obviously lose our fractional part of the quotient (result).
Instead, we'll need to shift the dividend up by as many bits as we shift the fractional part up...
((dividend_ip << 8) << 8) / (divisor_ip << 8)
...That makes it...
(dividend_ip << (dividend_precision + divisor_precision) / (divisor_ip << divisor_precision)
Now, let's put our fractional part math into the picture...
(((dividend_ip << dividend_precision) | dividend_fp) << divisor_precision) / ((divisor_ip << divisor_precision) | divisor_fp)
Our quotient's precision will be the same as dividend_precision, which is 8 bits.
Unfortunately, this eats a lot of bits.
Fortunately, in your case, the integer part is not important, so you'll have a lot of room for the fractional part.
Let's increase the precision to 15 bits; this can be tested using normal 32-bit integers...
(((dividend_ip << 15) | dividend_fp) << 15) / ((divisor_ip << 15) | divisor_fp)
Our quotient will now have a 15-bit precision.
OK, but since you're supplying only the fractional parts and the integer part is always zero anyway, you should be able to just toss the integer part. That makes it....
(((dividend_ip << 16) | dividend_fp) << 16) / ((divisor_ip << 16) | divisor_fp)
... reduced to ...
(dividend_fp << 16) / divisor_fp
... now let's use a 64-bit integer instead, we can get 32 bits of precision in the quotient...
(dividend_fp << 32) / divisor_fp
... some compilers have support for a int128_t (it can be enabled on some platforms for GCC), so you might be able to use that type, in order to get 128 bits easily. I have not tried it, but I've come across info on the Web earlier; search for int128_t, and you might find out how.
If you get the int128_t to work, you could make the dividend 128 bit, the divisor 64 bit and the quotient 64 bit...
quotient_fp = ((dividend_fp << 36) / divisor) >> (64 - 36)
... in order to get 36 bits precision.
Notice that since the result is in the top 36 bits of the quotient, the quotient needs to be shifted down (64 - 36) = 28 bits.
You could even go as high as (128 - 36) = 92 bits precision:
(dividend_fp << 92) / divisor
Now, that you probably (hopefully) have a solution, I would like to recommend that you get familiar with low-level binary divide (again; since you've been there a while ago).
The best sources seem to be how hardware divides binary numbers; such as microcontrollers, CPUs and the like. Assembly language dividers are also good for getting to know the inner workings. Often 32-bit divide routines that use bit-shifting are very good sources.
Through the time, I've come across a very clever implementation for ARM in ARM assembly language. Normally I wouldn't post references or assembly language examples, but considering that the code is very small, I think it would be alright.
Taken from A Fast Hi Precision Fixed Point Divide
r0 is the numerator (dividend)
r2 is the denominator (divisor)
mov r1,#0
adds r0,r0,r0
.rept 32
adcs r1,r2,r1,lsl#1
subcc r1,r1,r2
adcs r0,r0,r0
.endr
r0 is the quotient (result)
r1 is the remainder (rest, modulo result)
The above routine contains the basics for an unsigned divide.
I hope this information will be useful. It may contain errors, as I have not tested any code or example mentioned. I'm confident, though, that it's not all wrong. ;)

Verilog shift extending result?

We have the following line of code and we know that regF is 16 bits long, regD is 8 bits long and regE is 8 bits long, regC is 3 bits long and assumed unsigned:
regF <= regF + ( ( regD << regC ) & { 16{ regE [ regC ]} }) ;
My question is : will the shift regD << regC assume that the result is 8 bits or will it extended to 16 bits because of the bitwise & with the 16 bit vector?
The shift sub-expression itself has a width of 8 bits; the bit width of a shift is always the bit width of the left operand (see table 5-22 in the 2005 LRM).
However, things get more complicated after that. The shift sub-expression appears as an operand of the & operator. The bit length of the & expression is the bit-length of the largest of the 2 operands; in this case, 16 bits.
This sub-expression now appears as an operand of the + expression; the result width of this expression is again the maximum width of the two operands of the +, which is again 16.
We now have an assignment. This is not technically an operand, but the same rules are used; in this case, the LHS is also 16 bits, so the size of the RHS is unaffected.
We now know that the overall expression size is 16 bits; this size is propagated back down to the operands, except the 'self-determined' operands. The only self-determined operand here is the RHS of the shift expression (regC), which isn't extended.
The signedness of the expressions is now determined. Propagation happens in the same way. The overall effect here, since we have at least one unsigned operand, is that the expression is unsigned, and all operands are coerced to unsigned. So, all (non-self-determined) operands are coerced to unsigned 16-bit before any operation is actually carried out.
So, in other words, the shift sub-expression actually ends up as a 16-bit shift, even though it appears to be 8-bit at first sight. Note that it's not 16-bit because the RHS of the & is 16-bit, but because the entire sizing process - the width propagation up the expression - came up with an answer of 16. If you'd assigned to an 18-bit reg, instead of the 16-bit regF, then your shift would have been extended to 18 bits.
This is all very complicated and non-intuitive, at least if you have any experience of mainstream languages. It's explained (more or less) in sections 5.4 and 5.5 of the 2005 LRM. If you want any advice, then never write expressions like this. Write defensively - break everything down to individual sub-expressions, and then combine the sub-expressions.

OpenJDK's rehashing mechanism

Found this code on http://www.docjar.com/html/api/java/util/HashMap.java.html after searching for a HashMap implementation.
264 static int hash(int h) {
265 // This function ensures that hashCodes that differ only by
266 // constant multiples at each bit position have a bounded
267 // number of collisions (approximately 8 at default load factor).
268 h ^= (h >>> 20) ^ (h >>> 12);
269 return h ^ (h >>> 7) ^ (h >>> 4);
270 }
Can someone shed some light on this? The comment tells us why this code is here but I would like to understand how this improves a bad hash value and how it guarantees that the positions have bounded number of collisions. What do these magic numbers mean?
In order for it to make any sense it has to be combined with an understanding of how HashMap allocates things in to buckets. This is the trivial function by which a bucket index is chosen:
static int indexFor(int h, int length) {
return h & (length-1);
}
So you can see, that with a default table size of 16, only the 4 least significant bits of the hash actually matter for allocating buckets! (16 - 1 = 15, which masks the hash by 1111b)
This could clearly be bad news if your hashCode function returned:
10101100110101010101111010111111
01111100010111011001111010111111
11000000010100000001111010111111
//etc etc etc
Such a hash function would not likely be "bad" in any way that is visible to its author. But if you combine it with the way the map allocates buckets, boom, MapFail(tm).
If you keep in mind that h is a 32 bit number, those are not magic numbers at all. It is systematically xoring the most significant bits of the number rightward into the least significant bits. The purpose is so that "differences" in the number that occur anywhere "across" it when viewed in binary become visible down in the least significant bits.
Collisions become bounded because the number of different numbers that have the same relevant LSBs is now significantly bounded because any differences that occur anywhere in the binary representation are compressed into the bits that matter for bucket-ing.

Resources