Excel VBA understanding the randomize statement - excel

I am working on a little program which generates standard, normally distributed numbers, given a source of uniformly distributed random numbers. Therefore, I need to generate a bunch of random numbers. I decided to use the RND-Function as the program should be as fast is possible (so there is no extra seed function I'd like to use).
Doing some research I found the RND-Function works more efficiently using the Randomize statement immediately before. I don't understand the description of the optional number argument. I understood if I don't give the Randomize function any argument it'll use the system timer value as new seed value.
Can anyone explain to me what the optional number is actually doing with the function? Is there a difference between using Randomize(1) and Randomize(99) or even Randomize("blabla")? I'd like to understand the theory behind this optional input number. Thank You!

Seed is used to initialize a pseudorandom number generator. Basically, seed is used to generate pseudorandom numbers, you can think of it as starting point to generating random numbers. If seed is changing, randomness of numbers increases, that's why default use is to use current system time (as it is changing continuously).
From remarks on MSDN article you posted:
Randomize uses number to initialize the Rnd function's random-number generator, giving it a new seed value. If you omit number, the value returned by the system timer is used as the new seed value.
So, if you specify the argument, you will have always the same seed, thus decreasing randomness.
If Randomize is not used, the Rnd function (with no arguments) uses the same number as a seed the first time it is called, and thereafter uses the last generated number as a seed value.
Here we use last random number generated as seed, which increases randomness.

To quote from a very similar question on CrossValidated
Most pseudo-random number generators (PRNGs) are build (sic) on algorithms involving some kind of recursive method starting from a base value that is determined by an input called the "seed". The default PRNG in most statistical software (R, Python, Stata, etc.) is the Mersenne Twister algorithm MT19937, which is set out in Matsumoto and Nishimura (1998). This is a complicated algorithm, so it would be best to read the paper on it if you want to know how it works in detail. In this particular algorithm, there is a recurrence relation of degree $n$, and your input seed is an initial set of vectors x0, x1, ..., xn-1. The algorithm uses a linear recurrence relation that generates:
xn+k = f(xk, xk+1, xk+m, r, A)
where 1 <= m <= n and r and A are objects that can be specified as parameters in the algorithm. Since the seed gives the initial set of vectors (and given other fixed parameters for the algorithm), the series of pseudo-random numbers generated by the algorithm is fixed. If you change the seed then you change the initial vectors, which changes the pseudo-random numbers generated by the algorithm. This is, of course, the function of the seed.
Now, it is important to note that this is just one example, using the MT19937 algorithm. There are many PRNGs that can be used in statistical software, and they each involve different recursive methods, and so the seed means a different thing (in technical terms) in each of them. You can find a library of PRNGs for R in this documentation, which lists the available algorithms and the papers that describe these algorithms.
The purpose of the seed is to allow the user to "lock" the pseudo-random number generator, to allow replicable analysis. Some analysts like to set the seed using a true random-number generator (TRNG) which uses hardware inputs to generate an initial seed number, and then report this as a locked number. If the seed is set and reported by the original user then an auditor can repeat the analysis and obtain the same sequence of pseudo-random numbers as the original user. If the seed is not set then the algorithm will usually use some kind of default seed (e.g., from the system clock), and it will generally not be possible to replicate the randomisation.
As your quote in the question shows, the VBA randomize function will set a new seed for the RND-function, either using the system time as the seed or if you provide an argument for the function, it will use that number as the new seed for RND. If you don't call the Randomize function before calling the RND-function, the RND-function uses the previous number from RND as the new seed, so you may keep getting the same sequence of numbers.
I also recommend having a look at this answer.

Related

Which seed when using `pytorch.manual_seed(seed)`?

I have trained a model with ImageNet. I got a new GPU and I also want to train the same model on a different GPU.
I want to compare if the outcome is different and therefore I want to use torch.manual_seed(seed).
After reading the docs https://pytorch.org/docs/stable/generated/torch.manual_seed.html it is still unclear, which number I should take for the paramater seed.
How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?
How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?
The PyTorch doc page you are pointing to does not mention anything special, beyond stating that the seed is a 64 bits integer.
So yes, 1000 is OK. As you expect from a modern pseudo-random number generator, the statistical properties of the pseudo-random sequence you are relying on do NOT depend on the choice of seed.
As for most runs you will probably reuse the seed from a previous run, a practical thing is to have the random seed as a command line parameter. In those cases where you have to come up with a brand new seed, you can peek one from the top of your head, or just play dice to get it.
The important thing is to have a record of the seed used for each run.
OK, but ...
That being said, a lot of people seem uncomfortable with the task of just “guessing” some arbitrary number. There are (at least) two common expedients to get some seed in a seemingly proper “random” fashion.
The first one is to use the operating system official source for genuinely (not pseudo) random bits. In Python, this is typically rendered as os.urandom(). So to get a seed in Python as a 64 bits integer, you could use code like this:
import functools
import os
# returns a list of 8 random small integers between 0 and 255
def get8RandomBytesFromOS():
r8 = os.urandom(8) # official OS entropy source
byteCodes = list(map(ord, r8.decode('Latin-1'))) # type conversion
return byteCodes
# make a single long integer from a list of 8 integers between 0 and 255
def getIntFromBytes(bs):
# force highest bit to 0 to avoid overflow
bs2 = [bs[0] if (bs[0] < 128) else (bs[0]-128)] + bs[1:8]
num = functools.reduce(lambda acc,n: acc*256+n, bs2)
return num
# Main user entry point:
def getRandomSeedFromOS():
rbs8 = get8RandomBytesFromOS()
return (getIntFromBytes(rbs8))
A second common expedient is to hash a string containing the current date and time, possibly with some prefix. With Python, the time includes microseconds. When a human user launches a script, the microseconds number of the launch time can be said to be random. One can use code like this, using a version of SHA (Secure Hash Algorithm):
import hashlib
import datetime
def getRandomSeedFromTime():
prefix = 'dl.cs.univ-stackoverflow.edu'
timeString1 = str(datetime.datetime.now())
timeString2 = prefix + ' ' + timeString1
hash = hashlib.sha256(timeString2.encode('ascii'))
bytes = (hash.digest())[24:32] # 8 rightmost bytes of hash
byteCodes = list(map(ord, bytes.decode('Latin-1'))) # type conversion
return (getIntFromBytes(byteCodes))
But, again, 1000 is basically OK. The idea of hashing the time string, instead of just taking the number of microseconds since the Epoch as the seed, probably comes from the fact that some early random number generators used their seed as an offset into a common and not so large global sequence. Hence, if your program naïvely took two seeds in rapid sequence without hashing, there was a risk of overlap between your two sequences. Fortunately, pseudo-random number generators now are much better than what they used to be.
(Addendum - taken from comment)
Note that the seed is a peripheral thing. The important thing is the state of the automaton, which can be much larger than the seed, hence the very word “seed”. The commonly used Mersenne Twister scheme has ~20,000 bits of internal state, and you cannot ask the user to provide 20,000 bits. There are sometimes ill-behaved initial states, but it is always the responsibility of the random number library to expand somehow the user-provided arbitrary seed into a well-behaved initial state.

Spark Hashing TF power of two feature dimension recommendation reasoning

According to https://spark.apache.org/docs/2.3.0/ml-features.html#tf-idf:
"HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3."
...
"Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices."
I tried to understand why using a power of two as the feature dimension will map words evenly and tried find some helpful documentation on the internet to understand it, but both attempts were not successful.
Does somebody know or have useful sources on why using the power two maps words evenly to vector indices?
The output of a hash function is b-bit, i.e., there are 2^b possible values to which a feature can be hashed. Additionally, we assume that the 2^b possible values appear uniformly at random.
If d is the feature dimension, an index for a feature f is determined as hash(f) MOD d. Again, hash(f) takes on 2^b possible values. It is easy to see that d has to be a power of two (i.e., a divisor of 2^b) itself in order for uniformity to be maintained.
For a counter-example, consider a 2-bit hash function and a 3-dimensional feature space. As per our assumptions, the hash function outputs 0, 1, 2, or 3 with probability 1/4 each. However, taking mod 3 results in 0 with probability 1/2, or 1 or 2 with probability 1/4 each. Therefore, uniformity is not maintained. On the other hand; if the feature space were 2-dimensional, it is easy to see that the result would be 0 or 1 with probability 1/2 each.

TextRank Algorithm Space and Time Complexity

I am trying to determine the space and time complexity for TextRank the algorithm listed in this paper:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Since it is using PageRank whose complexity is:
O(n+m) ( n - number of nodes, m - number of arcs/edges)
and we run it over i iterations/until convergence the complexity for keyword extraction I believe it would be: O(i*(n+m))
and the space complexity would be O(V^2) using an adjacency matrix
While for sentence extraction I believe it would be the same thing.
I'm really not sure and any help would be great Thank you.
If you repeat T times an algorithm (inner) with complexity O(n+m), or whatever other for that matter, it is correct to conclude that the new algorithm (outer) will have a complexity of O(T*(n+m)) provided:
The outer algorithm will only add a constant complexity every time it repeats the inner one.
Parameters n and m remain the same at every invocation of the inner algorithm.
In other words, the outer algorithm should prepare the inputs for the inner one in constant time, and the parameters of new inputs should remain well represented by n and m along the T iterations. Otherwise, if any of these two requirements fail to be proved, you should sum T times the complexities associated to the new parameters, say
O(n1 + m1) + ... + O(n_T + m_T)
and also take into account all the pre- and post-processing of the outer algorithm before and after using the inner.

Mersenne twister: limitations used in agent based models

I am using the Mersnenne Twister as the engine to generate random numbers in an agent based model: it is fast and has an extremely long period before repeating.
Recently I did a literature review on this, while Colt library Java API recommends the Mersenne twister, I came across two limitations:
the seed should not be 0. Is this something suggested in the Apache Commons Math library ?
based on a cryptography paper, it was mentioned that "if the initial state has too many zeros then the generated sequence may also contain many zeros for more than 10000 generations and if the seeds are chosen systematically such as 0, 20, 30….. the output sequences will be correlated".
Has anyone come across such issues, or is it something fixed and not the case anymore ?
Is there any literature showing the spectral analysis of the Mersenne Twister vs the others like the Linear Congruential Generator?
SFMT has a better characteristic of zero-excess initial state.
A usual tip to get rid of zero-excess initialization of the seed is to use another PRNG (which might have near-equal probability of zeros and ones in the output) to generate the seed itself.
See also a comment on "How to properly seed a mersenne twister RNG?"

After computing the hash, what is the significance of keeping the only the last byte of the hash

Problem: To generate Test and train to improve on Generalization error.
possible solutions:
1. Split instances into train 80% and test 20%, train your model on trainset and tests on testset. But repeating above again and again will somehow let the model cram the data as in multiple time splits will select 1st time chosen instances of the testset into trainset(random sampling.)
The above approach might fail when we fetch an updated dataset.
Another approach is to select each instance's most stable feature/s(combination can be) to create a unique & immutable identifier that will remain robust even after the dataset updates.After selecting one, we could compute a hash of each instance's identifier, keep only the last two bytes of the hash, and put the instance in the test set if the value is <= 256 * test_ratio.}. This will ensure that testset will remain consistent across multiple runs, even if the dataset is refreshed.
Question: What is the significance of just taking last two bytes of the computed hash?
-----Thanks to Aurélien Géron-------
We need a solution to sample a unique test-set even after fetching a updated dataset.
SOLUTION: to use each instance's identifier to decide whether or not it should go to test_set.{Assuming that the instances have a unique and immutable identifier.
we could compute a hash of each instance's identifier, keep only the last bytes of the hash, and put the instance in the test set if value is <= 256*test_ratio i.e 51}
This ensures that the test-set will remain consistent across multiple runs, even if you refresh the dataset. The new test_set will contain 20% of the new instances, but it will not contain any instance that was previosly in the train_set.
First, a quick recap on hash functions:
A hash function f(x) is deterministic, such that if a==b, then f(a)==f(b).
Moreover, if a!=b, then with a very high probability f(a)!=f(b).
With this definition, a function such as f(x)=x%12345678 (where % is the modulo operator) meets the criterion above, so it is technically a hash function.However, most hash functions go beyond this definition, and they act more or less like pseudo-random number generators, so if you compute f(1), f(2), f(3),..., the output will look very much like a random sequence of (usually very large) numbers.
We can use such a "random-looking" hash function to split a dataset into a train set and a test set.
Let's take the MD5 hash function, for example. It is a random-looking hash function, but it outputs rather large numbers (128 bits), such as 136159519883784104948368321992814755841.
For a given instance in the dataset, there is 50% chance that its MD5 hash will be smaller than 2^127 (assuming the hashes are unsigned integers), and a 25% chance that it will be smaller than 2^126, and a 12.5% chance that it will be smaller than 2^125. So if I want to split the dataset into a train set and a test set, with 87.5% of the instances in the train set, and 12.5% in the test set, then all I need to do is to compute the MD5 hash of some unchanging features of the instances, and put the instances whose MD5 hash is smaller than 2^125 into the test set.
If I want precisely 10% of the instances to go into the test set, then I need to checkMD5 < 2^128 * 10 / 100.
This would work fine, and you can definitely implement it this way if you want. However, it means manipulating large integers, which is not always very convenient, especially given that Python's hashlib.md5() function outputs byte arrays, not long integers. So it's simpler to just take one or two bytes in the hash (anywhere you wish), and convert them to a regular integer. If you just take one byte, it will look like a random number from 0 to 255.
If you want to have 10% of the instances in the test set, you just need to check that the byte is smaller or equal to 25. It won't be exactly 10%, but actually 26/256=10.15625%, but that's close enough. If you want a higher precision, you can take 2 or more bytes.

Resources