Which seed when using `pytorch.manual_seed(seed)`? - python-3.x

I have trained a model with ImageNet. I got a new GPU and I also want to train the same model on a different GPU.
I want to compare if the outcome is different and therefore I want to use torch.manual_seed(seed).
After reading the docs https://pytorch.org/docs/stable/generated/torch.manual_seed.html it is still unclear, which number I should take for the paramater seed.
How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?

How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?
The PyTorch doc page you are pointing to does not mention anything special, beyond stating that the seed is a 64 bits integer.
So yes, 1000 is OK. As you expect from a modern pseudo-random number generator, the statistical properties of the pseudo-random sequence you are relying on do NOT depend on the choice of seed.
As for most runs you will probably reuse the seed from a previous run, a practical thing is to have the random seed as a command line parameter. In those cases where you have to come up with a brand new seed, you can peek one from the top of your head, or just play dice to get it.
The important thing is to have a record of the seed used for each run.
OK, but ...
That being said, a lot of people seem uncomfortable with the task of just “guessing” some arbitrary number. There are (at least) two common expedients to get some seed in a seemingly proper “random” fashion.
The first one is to use the operating system official source for genuinely (not pseudo) random bits. In Python, this is typically rendered as os.urandom(). So to get a seed in Python as a 64 bits integer, you could use code like this:
import functools
import os
# returns a list of 8 random small integers between 0 and 255
def get8RandomBytesFromOS():
r8 = os.urandom(8) # official OS entropy source
byteCodes = list(map(ord, r8.decode('Latin-1'))) # type conversion
return byteCodes
# make a single long integer from a list of 8 integers between 0 and 255
def getIntFromBytes(bs):
# force highest bit to 0 to avoid overflow
bs2 = [bs[0] if (bs[0] < 128) else (bs[0]-128)] + bs[1:8]
num = functools.reduce(lambda acc,n: acc*256+n, bs2)
return num
# Main user entry point:
def getRandomSeedFromOS():
rbs8 = get8RandomBytesFromOS()
return (getIntFromBytes(rbs8))
A second common expedient is to hash a string containing the current date and time, possibly with some prefix. With Python, the time includes microseconds. When a human user launches a script, the microseconds number of the launch time can be said to be random. One can use code like this, using a version of SHA (Secure Hash Algorithm):
import hashlib
import datetime
def getRandomSeedFromTime():
prefix = 'dl.cs.univ-stackoverflow.edu'
timeString1 = str(datetime.datetime.now())
timeString2 = prefix + ' ' + timeString1
hash = hashlib.sha256(timeString2.encode('ascii'))
bytes = (hash.digest())[24:32] # 8 rightmost bytes of hash
byteCodes = list(map(ord, bytes.decode('Latin-1'))) # type conversion
return (getIntFromBytes(byteCodes))
But, again, 1000 is basically OK. The idea of hashing the time string, instead of just taking the number of microseconds since the Epoch as the seed, probably comes from the fact that some early random number generators used their seed as an offset into a common and not so large global sequence. Hence, if your program naïvely took two seeds in rapid sequence without hashing, there was a risk of overlap between your two sequences. Fortunately, pseudo-random number generators now are much better than what they used to be.
(Addendum - taken from comment)
Note that the seed is a peripheral thing. The important thing is the state of the automaton, which can be much larger than the seed, hence the very word “seed”. The commonly used Mersenne Twister scheme has ~20,000 bits of internal state, and you cannot ask the user to provide 20,000 bits. There are sometimes ill-behaved initial states, but it is always the responsibility of the random number library to expand somehow the user-provided arbitrary seed into a well-behaved initial state.

Related

How to efficiently perform billions of Bernoulli extractions using numpy?

I am working at a thesis about epidemiology, and I have to simulate a SI epidemic in a temporal network. At each time step there's a probability ~ Bernoulli(beta) to perform an extraction between an infected and a susceptible node. I am using np.random.binomial(size=whatever, n=1, p=beta) to make the computer decide. Now, I have to simulate the epidemic in the same network by making it start from each one of the nodes. This should be repeated K times to get some statistically relevant results for each node, and, since the temporal network is stochastic too, everything should be repeated NET_REALIZATION times.
So, in a network with N = 100, if K=500 and NET=REALIZATION=500, the epidemic should be repeated 25,000,000‬ times. If T=100, it means 2,500,000,000‬ extractions per set of S-I couples (which of course varies in time). If beta is small, which is often the case, this leads to a very time-spending computation.
If you think that, for my computer, the bernoulli extraction takes 3.63 µs to happen, this means I have to wait hours to get some results, which is really limitating the development of my thesis.
The problem is that more than half of the time is just spent in random extractions.
I should use numpy since the results of extractions interact with other data structures. I tried to use numba, but it didn't seem to improve extractions' speed.
Is there a faster way to get the same results? I was thinking about doing a very very big extraction once forever, something like 10^12 extractions of 0s and 1s, and just import a part of them for each different simulation (this should be repeated for several values of beta), but I wonder if there's a smarter move.
Thanks for help
If you can express your betas as increments of 2^-N (for example, increments of 1/256 if N is 8.), then extract random N-bit chunks and determine whether each chunk is less than beta * 2^N. This works better if 32 is evenly divisible by N.
Note that numpy.random.uniform produces random floating-point numbers, and is expected to be slower than producing random integers or bits. This is especially because generating random floating-point numbers depends on generating random integers — not the other way around.
The following is an example of how this idea works.
import numpy
# Fixed seed for demonstration purposes
rs = numpy.random.RandomState(777778)
# Generate 10 integers in [0, 256)
ri = rs.randint(0, 256, 10)
# Now each integer x can be expressed, say, as a Bernoulli(5/256)
# variable which is 0 if x < 5, and 1 otherwise. I haven't tested
# the following, which is similar to an example you gave in a
# comment.
rbern = (ri>=5) * 1
If you can use NumPy 1.17 or later, the following alternative exists:
import numpy
rs = numpy.random.default_rng()
ri = rs.integers(0, 256, 10)
Note also that NumPy 1.17 introduces a new random number generation system alongside the legacy one. Perhaps it has better performance generating Bernoulli and binomial variables than the old one, especially because its default RNG, PCG64, is lighter-weight than the legacy system's default, Mersenne Twister. The following is an example.
import numpy
beta = 5.0/256
rs = numpy.random.default_rng()
rbinom = rs.binomial(10, beta)

Excel VBA understanding the randomize statement

I am working on a little program which generates standard, normally distributed numbers, given a source of uniformly distributed random numbers. Therefore, I need to generate a bunch of random numbers. I decided to use the RND-Function as the program should be as fast is possible (so there is no extra seed function I'd like to use).
Doing some research I found the RND-Function works more efficiently using the Randomize statement immediately before. I don't understand the description of the optional number argument. I understood if I don't give the Randomize function any argument it'll use the system timer value as new seed value.
Can anyone explain to me what the optional number is actually doing with the function? Is there a difference between using Randomize(1) and Randomize(99) or even Randomize("blabla")? I'd like to understand the theory behind this optional input number. Thank You!
Seed is used to initialize a pseudorandom number generator. Basically, seed is used to generate pseudorandom numbers, you can think of it as starting point to generating random numbers. If seed is changing, randomness of numbers increases, that's why default use is to use current system time (as it is changing continuously).
From remarks on MSDN article you posted:
Randomize uses number to initialize the Rnd function's random-number generator, giving it a new seed value. If you omit number, the value returned by the system timer is used as the new seed value.
So, if you specify the argument, you will have always the same seed, thus decreasing randomness.
If Randomize is not used, the Rnd function (with no arguments) uses the same number as a seed the first time it is called, and thereafter uses the last generated number as a seed value.
Here we use last random number generated as seed, which increases randomness.
To quote from a very similar question on CrossValidated
Most pseudo-random number generators (PRNGs) are build (sic) on algorithms involving some kind of recursive method starting from a base value that is determined by an input called the "seed". The default PRNG in most statistical software (R, Python, Stata, etc.) is the Mersenne Twister algorithm MT19937, which is set out in Matsumoto and Nishimura (1998). This is a complicated algorithm, so it would be best to read the paper on it if you want to know how it works in detail. In this particular algorithm, there is a recurrence relation of degree $n$, and your input seed is an initial set of vectors x0, x1, ..., xn-1. The algorithm uses a linear recurrence relation that generates:
xn+k = f(xk, xk+1, xk+m, r, A)
where 1 <= m <= n and r and A are objects that can be specified as parameters in the algorithm. Since the seed gives the initial set of vectors (and given other fixed parameters for the algorithm), the series of pseudo-random numbers generated by the algorithm is fixed. If you change the seed then you change the initial vectors, which changes the pseudo-random numbers generated by the algorithm. This is, of course, the function of the seed.
Now, it is important to note that this is just one example, using the MT19937 algorithm. There are many PRNGs that can be used in statistical software, and they each involve different recursive methods, and so the seed means a different thing (in technical terms) in each of them. You can find a library of PRNGs for R in this documentation, which lists the available algorithms and the papers that describe these algorithms.
The purpose of the seed is to allow the user to "lock" the pseudo-random number generator, to allow replicable analysis. Some analysts like to set the seed using a true random-number generator (TRNG) which uses hardware inputs to generate an initial seed number, and then report this as a locked number. If the seed is set and reported by the original user then an auditor can repeat the analysis and obtain the same sequence of pseudo-random numbers as the original user. If the seed is not set then the algorithm will usually use some kind of default seed (e.g., from the system clock), and it will generally not be possible to replicate the randomisation.
As your quote in the question shows, the VBA randomize function will set a new seed for the RND-function, either using the system time as the seed or if you provide an argument for the function, it will use that number as the new seed for RND. If you don't call the Randomize function before calling the RND-function, the RND-function uses the previous number from RND as the new seed, so you may keep getting the same sequence of numbers.
I also recommend having a look at this answer.

Are the odds of a cryptographically secure random number generator generating the same uuid small enough that you do not need to check for uniqueness?

I'm using this with a length of 20 for uuid. Is it common practice to not check if the uuid generated has not been used already if it's used for a persistent unique value?
Or is it best practice to verify it's not already being used by some part of your application if it's essential to retain uniqueness.
You can calculate the probability of a collision using this formula from Wikipedia::
     
where n(p; H) is the smallest number of samples you have to choose in order to find a collision with a probability of at least p, given H possible outputs with equal probability.
The same article also provides Python source code that you can use to calculate this value:
from math import log1p, sqrt
def birthday(probability_exponent, bits):
probability = 10. ** probability_exponent
outputs = 2. ** bits
return sqrt(2. * outputs * -log1p(-probability))
So if you're generating UUIDs with 20 bytes (160 bits) of random data, how sure can you be that there won't be any collisions? Let's suppose you want there to be a probability of less than one in a quintillion (10–18) that a collision will occur:
>>> birthday(-18,160)
1709679290002018.5
This means that after generating about 1.7 quadrillion UUIDs with 20 bytes of random data each, there is only a one in 1 a quintillion chance that two of these UUIDs will be the same.
Basically, 20 bytes is perfectly adequate.
crypto.RandomBytes is safe enough for most applications. If you want it to by completely secure, use a length of 16. Once there is a length of 16 there will likely never be a collision in the nearest century. And it is definitely not a good idea to check an entire database for any duplicates, because the odds are so low that the performance debuff outweighs the security.

After computing the hash, what is the significance of keeping the only the last byte of the hash

Problem: To generate Test and train to improve on Generalization error.
possible solutions:
1. Split instances into train 80% and test 20%, train your model on trainset and tests on testset. But repeating above again and again will somehow let the model cram the data as in multiple time splits will select 1st time chosen instances of the testset into trainset(random sampling.)
The above approach might fail when we fetch an updated dataset.
Another approach is to select each instance's most stable feature/s(combination can be) to create a unique & immutable identifier that will remain robust even after the dataset updates.After selecting one, we could compute a hash of each instance's identifier, keep only the last two bytes of the hash, and put the instance in the test set if the value is <= 256 * test_ratio.}. This will ensure that testset will remain consistent across multiple runs, even if the dataset is refreshed.
Question: What is the significance of just taking last two bytes of the computed hash?
-----Thanks to Aurélien Géron-------
We need a solution to sample a unique test-set even after fetching a updated dataset.
SOLUTION: to use each instance's identifier to decide whether or not it should go to test_set.{Assuming that the instances have a unique and immutable identifier.
we could compute a hash of each instance's identifier, keep only the last bytes of the hash, and put the instance in the test set if value is <= 256*test_ratio i.e 51}
This ensures that the test-set will remain consistent across multiple runs, even if you refresh the dataset. The new test_set will contain 20% of the new instances, but it will not contain any instance that was previosly in the train_set.
First, a quick recap on hash functions:
A hash function f(x) is deterministic, such that if a==b, then f(a)==f(b).
Moreover, if a!=b, then with a very high probability f(a)!=f(b).
With this definition, a function such as f(x)=x%12345678 (where % is the modulo operator) meets the criterion above, so it is technically a hash function.However, most hash functions go beyond this definition, and they act more or less like pseudo-random number generators, so if you compute f(1), f(2), f(3),..., the output will look very much like a random sequence of (usually very large) numbers.
We can use such a "random-looking" hash function to split a dataset into a train set and a test set.
Let's take the MD5 hash function, for example. It is a random-looking hash function, but it outputs rather large numbers (128 bits), such as 136159519883784104948368321992814755841.
For a given instance in the dataset, there is 50% chance that its MD5 hash will be smaller than 2^127 (assuming the hashes are unsigned integers), and a 25% chance that it will be smaller than 2^126, and a 12.5% chance that it will be smaller than 2^125. So if I want to split the dataset into a train set and a test set, with 87.5% of the instances in the train set, and 12.5% in the test set, then all I need to do is to compute the MD5 hash of some unchanging features of the instances, and put the instances whose MD5 hash is smaller than 2^125 into the test set.
If I want precisely 10% of the instances to go into the test set, then I need to checkMD5 < 2^128 * 10 / 100.
This would work fine, and you can definitely implement it this way if you want. However, it means manipulating large integers, which is not always very convenient, especially given that Python's hashlib.md5() function outputs byte arrays, not long integers. So it's simpler to just take one or two bytes in the hash (anywhere you wish), and convert them to a regular integer. If you just take one byte, it will look like a random number from 0 to 255.
If you want to have 10% of the instances in the test set, you just need to check that the byte is smaller or equal to 25. It won't be exactly 10%, but actually 26/256=10.15625%, but that's close enough. If you want a higher precision, you can take 2 or more bytes.

Can you retrieve the original decimal number from the least significant bits of another operation?

I am performing an operation where a function F(k,x) takes two 64bit values and returns the product of their decimal numbers. For example:
F(123,231) = 123 x 231 = 28413
The number is then converted into binary and the least significant bits are extracted. i.e. if 28413 = 0110111011111101 then we take 11111101, which is 253 in decimal.
This function is part of a Feistel network in security. When performing a type of attack (chosen plaintext) we get to the point where we have 253 and 231, but need to figure out 123.
Is there any way that is possible?
Your function is doing F(k,x) = k*x mod 256.
Your question is given F(k,x) and x, can you find k?
When x is odd, there are 2^56 solutions, all of which have k = x^-1 * F(k,x) mod 256. That is, you compute the inverse of x mod 256, and each possible solution is derived by adding a multiple of 256 to the product of F(k,x) with that value.
When x is even, you can't compute the inverse, but you can still determine the solutions using a similar trick. You need to first compute the number of twos (2s) that divide x, say it is t twos, and then divide out 2^t from x and 256, then solve the problem from there. i.e. k = (x/2^t)^-1 * F(k,x) mod (256/2^t).
Generally using multiplies in cipher designs is dangerous, especially due to chosen plaintext attacks, because an attacker can make things disappear to simplify his attack. You can find examples of breaking ciphers like that on my blog (see attacks on chaotic hash function and multiprime).
No.
By dropping the most significant bits, the operation is rendered mono-directional. In order to recover the 123 you would have to brute-force the function with every possibility until the result was the value you want.
I.e. run F(x,231) for values of x until the result of F is 253.
That said, knowing one of the two inputs and the output makes it relatively easy to brute force. It would depend on the number of valid values for x (e.g. is it always a 3 digit number? Always prime? Always odd?)
There may be some other shortcuts, depending on the patterns that multiplying a number of 231 gets you, but any given value for that number will have different patterns. e.g. if it was 9 instead of 231, you would know that the sum of the digits always summed to 9.

Resources