I wrote a simple sample code here. In fact, elements will be added or deleted from the set and a random element will be chosen from the set on each iteration in my program.
But even if I run the simplified code below, I got different output every time I run the codes. So, how to make the outputs reproducible?
import random
random.seed(0)
x = set()
for i in range(40):
x.add('a'+str(i))
print(random.sample(x, 1))
The problem is that a set's elements are unordered, and will vary between runs even if the random sample chooses the same thing. Using random.sample on a set is in fact deprecated since python 3.9, and in the future you will need to input a sequence instead.
You could do this by converting the set to a sequence in a consistently ordered way, such as
x = sorted(x)
or probably better, just use a type like list in the first place (always producing ['a24'] in your example).
x = ['a' + str(i) for i in range(40)]
Related
I've written a large program, with dependencies on libraries written in my lab. I'm getting wrong (and somewhat random) results, which are caused by floating-point errors.
I would like to do some python magic and change all floats to decimals, or some other more precise type.
I can't write the full code here, but following is the general flow -
def run(n):
...
x = 0.5 # initializes as float
for _ in range(n):
x = calc(x)
...
return x
What I'm trying to avoid is to go over all initialization in the code and add a manual cast to decimal.
Is there a trick I can do to make python initialize all floats in lines such as x = 0.5 as decimals? or perhaps use a custom interpreter which has more exact floats?
Thanks,
I can't post the full code, hope my edit makes it clearer.
I think you can use this:
from decimal import Decimal
Decimal(variable)
I have trained a model with ImageNet. I got a new GPU and I also want to train the same model on a different GPU.
I want to compare if the outcome is different and therefore I want to use torch.manual_seed(seed).
After reading the docs https://pytorch.org/docs/stable/generated/torch.manual_seed.html it is still unclear, which number I should take for the paramater seed.
How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?
How to choose the seed parameter? Can I take 10, 100, or 1000 or even more or less and why?
The PyTorch doc page you are pointing to does not mention anything special, beyond stating that the seed is a 64 bits integer.
So yes, 1000 is OK. As you expect from a modern pseudo-random number generator, the statistical properties of the pseudo-random sequence you are relying on do NOT depend on the choice of seed.
As for most runs you will probably reuse the seed from a previous run, a practical thing is to have the random seed as a command line parameter. In those cases where you have to come up with a brand new seed, you can peek one from the top of your head, or just play dice to get it.
The important thing is to have a record of the seed used for each run.
OK, but ...
That being said, a lot of people seem uncomfortable with the task of just “guessing” some arbitrary number. There are (at least) two common expedients to get some seed in a seemingly proper “random” fashion.
The first one is to use the operating system official source for genuinely (not pseudo) random bits. In Python, this is typically rendered as os.urandom(). So to get a seed in Python as a 64 bits integer, you could use code like this:
import functools
import os
# returns a list of 8 random small integers between 0 and 255
def get8RandomBytesFromOS():
r8 = os.urandom(8) # official OS entropy source
byteCodes = list(map(ord, r8.decode('Latin-1'))) # type conversion
return byteCodes
# make a single long integer from a list of 8 integers between 0 and 255
def getIntFromBytes(bs):
# force highest bit to 0 to avoid overflow
bs2 = [bs[0] if (bs[0] < 128) else (bs[0]-128)] + bs[1:8]
num = functools.reduce(lambda acc,n: acc*256+n, bs2)
return num
# Main user entry point:
def getRandomSeedFromOS():
rbs8 = get8RandomBytesFromOS()
return (getIntFromBytes(rbs8))
A second common expedient is to hash a string containing the current date and time, possibly with some prefix. With Python, the time includes microseconds. When a human user launches a script, the microseconds number of the launch time can be said to be random. One can use code like this, using a version of SHA (Secure Hash Algorithm):
import hashlib
import datetime
def getRandomSeedFromTime():
prefix = 'dl.cs.univ-stackoverflow.edu'
timeString1 = str(datetime.datetime.now())
timeString2 = prefix + ' ' + timeString1
hash = hashlib.sha256(timeString2.encode('ascii'))
bytes = (hash.digest())[24:32] # 8 rightmost bytes of hash
byteCodes = list(map(ord, bytes.decode('Latin-1'))) # type conversion
return (getIntFromBytes(byteCodes))
But, again, 1000 is basically OK. The idea of hashing the time string, instead of just taking the number of microseconds since the Epoch as the seed, probably comes from the fact that some early random number generators used their seed as an offset into a common and not so large global sequence. Hence, if your program naïvely took two seeds in rapid sequence without hashing, there was a risk of overlap between your two sequences. Fortunately, pseudo-random number generators now are much better than what they used to be.
(Addendum - taken from comment)
Note that the seed is a peripheral thing. The important thing is the state of the automaton, which can be much larger than the seed, hence the very word “seed”. The commonly used Mersenne Twister scheme has ~20,000 bits of internal state, and you cannot ask the user to provide 20,000 bits. There are sometimes ill-behaved initial states, but it is always the responsibility of the random number library to expand somehow the user-provided arbitrary seed into a well-behaved initial state.
I need an array of the sums of 3x3 neighboring cells with products based on a kernel of a different array with the same size (this is exactly scipy.ndimage.correlate up to this point). But when a value for the new array is calculated it has to be updated immediately instead of using the value from the original array for the next computation involving that value. I have written this slow code to implement it myself, which is working perfectly fine (although too slow for me) and delivering the expected result:
for x in range(width):
for y in range(height):
AArr[y,x] += laplaceNeighborDifference(x,y)
def laplaceNeighborDifference(x,y,z):
global w, h, AArr
return -AArr[y,x]+AArr[(y+1)%h,x]*.2+AArr[(y-1)%h,x]*.2+AArr[y,(x+1)%w]*.2+AArr[y,(x-1)%w]*.2+AArr[(y+1)%h,(x+1)%w]*.05+AArr[(y-1)%h,(x+1)%w]*.05+AArr[(y+1)%h,(x-1)%w]*.05+AArr[(y-1)%h,(x-1)%w]*.05
In my approach the kernel is coded directly. Although as an array (to be used as a kernel) it would be written like this:
[[.05,.2,.05],
[.2 ,-1,.2 ],
[.05,.2,.05]]
The SciPy implementation would work like this:
AArr += correlate(AArr, kernel, mode='wrap')
But obviously when I use scipy.ndimage.correlate it calculates the values entirely based on the original array and doesn't update them as it computes them. At least I think that is the difference between my implementation and the SciPy implementation, feel free to point out other differences if I've missed one. My question is if there is a similar function to the aforementioned with desired results or if there is an approach to code it which is faster than mine?
Thank you for your time!
You can use Numba to do that efficiently:
import numba as nb
#nb.njit
def laplaceNeighborDifference(AArr,w,h,x,y):
return -AArr[y,x]+AArr[(y+1)%h,x]*.2+AArr[(y-1)%h,x]*.2+AArr[y,(x+1)%w]*.2+AArr[y,(x-1)%w]*.2+AArr[(y+1)%h,(x+1)%w]*.05+AArr[(y-1)%h,(x+1)%w]*.05+AArr[(y+1)%h,(x-1)%w]*.05+AArr[(y-1)%h,(x-1)%w]*.05
#nb.njit('void(float64[:,::1],int64,int64)')
def compute(AArr,width,height):
for x in range(width):
for y in range(height):
AArr[y,x] += laplaceNeighborDifference(AArr,width,height,x,y)
Note that modulus are generally very slow. This is better to remove them by computing the border separately of the main loop. The resulting code should be much faster without any modulus.
I need to repeat N times a scientific simulation based on a random sampling, easily:
results = [mysimulation() for i in range(N)]
Since every simulation require minutes, I'd like to parallelize them in order to reduce the execution time. Some weeks ago I successfully analyzed some simpler cases, for which I wrote my code in C using OpenMP and functions like rand_r() for avoiding seed overlapping. How could I obtain a similar effect in Python?
I tried reading more about python3 multithreading/parallelization, but I found no results concerning the random generation. Conversely, numpy.random does not suggest anything in this direction (as far as I found).
I have an input source that gives me integers in [0..256].
I want to be able to locate spikes in this data, i.e. a new input.
I've tried using a rolling average in conjunction with finding the percent error. But this doesn't really work.
Basically, I want my program to find where a graph of the data would spike up, but I want it to ignore smooth transitions.
Thoughts?
A simple thought which follows my comment. First
>>> import numpy as np
Suppose we have the following time series
>>> sample = np.random.random_integers(0,256,size=(100,))
To know whether or not a spike can be considered as a rare event, we have to know the likelihood associated to each event. Since you are dealing with "rates of change", let us compute those
>>> sample_vars = np.abs(-1 + 1.*sample[1:]/sample[:-1]) # the "1.*" to get floats... (python<3)
We can then define the variation which has at most 5 percent (sample-) chance of occurring
>>> spike_defining_threshold = np.percentile(sample_vars, 95)
Finally if sample_vars[-1]>spike_defining_threshold
Would be great if others have thoughts to share as well...