Calculating a custom probability distribution in python (numerically) - python-3.x

I have a custom (discrete) probability distribution defined somewhat in the form: f(x)/(sum(f(x')) for x' in a given discrete set X). Also, 0<=x<=1.
So I have been trying to implement it in python 3.8.2, and the problem is that the numerator and denominator both come out to be really small and python's floating point representation just takes them as 0.0.
After calculating these probabilities, I need to sample a random element from an array, whose each index may be selected with the corresponding probability in the distribution. So if my distribution is [p1,p2,p3,p4], and my array is [a1,a2,a3,a4], then probability of selecting a2 is p2 and so on.
So how can I implement this in an elegant and efficient way?
Is there any way I could use the np.random.beta() in this case? Since the difference between the beta distribution and my actual distribution is only that the normalization constant differs and the domain is restricted to a few points.
Note: The Probability Mass function defined above is actually in the form given by the Bayes theorem and f(x)=x^s*(1-x)^f, where s and f are fixed numbers for a given iteration. So the exact problem is that, when s or f become really large, this thing goes to 0.

You could well compute things by working with logs. The point is that while both the numerator and denominator might underflow to 0, their logs won't unless your numbers are really astonishingly small.
You say
f(x) = x^s*(1-x)^t
so
logf (x) = s*log(x) + t*log(1-x)
and you want to compute, say
p = f(x) / Sum{ y in X | f(y)}
so
p = exp( logf(x) - log sum { y in X | f(y)}
= exp( logf(x) - log sum { y in X | exp( logf( y))}
The only difficulty is in computing the second term, but this is a common problem, for example here
On the other hand computing logsumexp is easy enough to to by hand.
We want
S = log( sum{ i | exp(l[i])})
if L is the maximum of the l[i] then
S = log( exp(L)*sum{ i | exp(l[i]-L)})
= L + log( sum{ i | exp( l[i]-L)})
The last sum can be computed as written, because each term is now between 0 and 1 so there is no danger of overflow, and one of the terms (the one for which l[i]==L) is 1, and so if other terms underflow, that is harmless.
This may however lose a little accuracy. A refinement would be to recognize the set A of indices where
l[i]>=L-eps (eps a user set parameter, eg 1)
And then compute
N = Sum{ i in A | exp(l[i]-L)}
B = log1p( Sum{ i not in A | exp(l[i]-L)}/N)
S = L + log( N) + B

Related

Given a exponential probability density function, how to generate random values using the random generator in Excel?

Based on a set of experiments, a probability density function (PDF) for an exponentially distributed variable was generated. Now the goal is to use this function in a Monte carlo simulation. I am vaguely familiar with PDF's and random generator, especially for normal and log-normal distributions. However, I am not quite able to figure this out. Would be great if someone can help.
Here's the function:
f = γ/2R * exp⁡(-γl/2R) (1-exp⁡(-γ) )^(-1) H (2R-l)
f is the probability density function,
1/γ is the mean of the distribution,
R is a known fixed variable,
H is the heaviside step function,
l is the variable that is exponentially distributed
Well. I don't know how to do it in Excel, but using inverse method it is easy to get the answer (assuming there is RANDOM() function which returns uniform numbers in the [0...1] range)
l = -(2R/γ)*LOG(1 - RANDOM()*(1-EXP(-γ)))
Easy to check boundary values
if RANDOM()=0, then l = 0
if RANDOM()=1, then l = 2R
UPDATE
So there is a PDF
PDF(l|R,γ) = γ/2R * exp⁡(-lγ/2R)/(1-exp⁡(-γ)), l in the range [0...2R]
First, check that it is normalized
∫ PDF(l|R,γ) dl from 0 to 2R = 1
Ok, it is normalized
Then compute CDF(l|R,γ)
CDF(l|R,γ) = ∫ PDF(l|R,γ) dl from 0 to l =
(1 - exp⁡(-lγ/2R))/(1-exp⁡(-γ))
Check again, CDF(l=2R|R,γ) = 1, good.
Now set CDF(l|R,γ)=RANDOM(), solve it wrt l and get your sampling expression. Check it at the RANDOM() returning 0 or RANDOM() returning 1, you should get end points of l interval.

How to understand this efficient implementation of PageRank calculation

For reference, I'm using this page. I understand the original pagerank equation
but I'm failing to understand why the sparse-matrix implementation is correct. Below is their code reproduced:
def compute_PageRank(G, beta=0.85, epsilon=10**-4):
'''
Efficient computation of the PageRank values using a sparse adjacency
matrix and the iterative power method.
Parameters
----------
G : boolean adjacency matrix. np.bool8
If the element j,i is True, means that there is a link from i to j.
beta: 1-teleportation probability.
epsilon: stop condition. Minimum allowed amount of change in the PageRanks
between iterations.
Returns
-------
output : tuple
PageRank array normalized top one.
Number of iterations.
'''
#Test adjacency matrix is OK
n,_ = G.shape
assert(G.shape==(n,n))
#Constants Speed-UP
deg_out_beta = G.sum(axis=0).T/beta #vector
#Initialize
ranks = np.ones((n,1))/n #vector
time = 0
flag = True
while flag:
time +=1
with np.errstate(divide='ignore'): # Ignore division by 0 on ranks/deg_out_beta
new_ranks = G.dot((ranks/deg_out_beta)) #vector
#Leaked PageRank
new_ranks += (1-new_ranks.sum())/n
#Stop condition
if np.linalg.norm(ranks-new_ranks,ord=1)<=epsilon:
flag = False
ranks = new_ranks
return(ranks, time)
To start, I'm trying to trace the code and understand how it relates to the PageRank equation. For the line under the with statement (new_ranks = G.dot((ranks/deg_out_beta))), this looks like the first part of the equation (the beta times M) BUT it seems to be ignoring all divide by zeros. I'm confused by this because the PageRank algorithm requires us to replace zero columns with ones (except along the diagonal). I'm not sure how this is accounted for here.
The next line new_ranks += (1-new_ranks.sum())/n is what I presume to be the second part of the equation. I can understand what this does, but I can't see how this translates to the original equation. I would've thought we would do something like new_ranks += (1-beta)*ranks.sum()/n.
This happens because in the row sums
e.T * M * r = e.T * r
by the column sum construction of M. The convex combination with coefficient beta has the effect that the sum over the new r vector is again 1. Now what the algorithm does is to take the first matrix-vector product b=beta*M*r and then find a constant c so that r_new = b+c*e has row sum one. In theory this should be the same as what the formula says, but in the floating point practice this approach corrects and prevents floating point error accumulation in the sum of r.
Computing it this way also allows to ignore zero columns, as the compensation for them is automatically computed.

What is this normalization curve? Constant ^ (Constant ^ Observation Indexed to 100)

My apologies, but I'm not quite sure how to even ask this question. I have some normalization curves I've been using at work, and I'd like to know more about them so I speak about them intelligently. They have an s shape like a sigmoid function, but their general formula is the following:
Constant ^ (Constant ^ Observation Indexed to 100)
First, index a variable from 0 to 100 with the highest observation equal to 100, then insert into the equations below for curves with different slopes.
s1 = 0.0000000001 ^ (0.97 ^ Index)
s2 = 0.0000000002 ^ (0.962 ^ Index)
s3 = 0.0000000003 ^ (0.953 ^ Index)
And so on, up to s10. The resulting values are compressed between 0 and 1. s10 has the steepest slope with values that skew toward 1, and s1 has the shallowest slope with values that skew toward 0.
I think they're very clever, and they work well for our purposes, but I don't know what to even call them. Can anyone point me in the right direction? Again, apologies for the vagueness and if this is inappropriately tagged.
The functions you describe are special cases of the Gompertz functions; Gompertz functions have a sigmoidal shape and have many applications across different domains. For example in biology, Gompertz functions are used to model bacterial and tumour cell growth.
To see how your equations relate to the more general Gompertz functions, let's rewrite the equations for s
On a side note, we can see that taking the double-log of s (i.e. log log s) linearises the equation as a function of the index.
We can now compare this with the more general Gompertz function
Taking the natural logarithm gives
We then set a=1 and take the natural logarithm again
So the equations you give are algebraically identical to the Gompertz functions with parameters
Let's plot the function for the three sets of parameters that you give in your post (I use R here but it's easy to do something similar in e.g. Python)
# Define a function f which takes the index and two parameters a and b
# We use a helper function scale01 to scale the values of f in the interval [0,1]
# using min-max scaling
scale01 <- function(x) (x - min(x)) / (max(x) - min(x))
f <- function(idx, a, b) scale01(a ^ (b ^ idx))
# Calculate s for the three different sets of parameters and
# using integer index values from 0 to 100
idx <- 0:100
lst <- lapply(list(
s1 = list(a = 0.0000000001, b = 0.97),
s2 = list(a = 0.0000000002, b = 0.962),
s3 = list(a = 0.0000000003, b = 0.953)),
function(pars) f(idx, a = pars$a, b = pars$b))
# Plot
library(ggplot2)
df <- cbind(idx = idx, stack(lst))
ggplot(df, aes(idx, values, colour = ind)) + geom_line()

Solving vector second order differential equation while indexing into an array

I'm attempting to solve the differential equation:
m(t) = M(x)x'' + C(x, x') + B x'
where x and x' are vectors with 2 entries representing the angles and angular velocity in a dynamical system. M(x) is a 2x2 matrix that is a function of the components of theta, C is a 2x1 vector that is a function of theta and theta' and B is a 2x2 matrix of constants. m(t) is a 2*1001 array containing the torques applied to each of the two joints at the 1001 time steps and I would like to calculate the evolution of the angles as a function of those 1001 time steps.
I've transformed it to standard form such that :
x'' = M(x)^-1 (m(t) - C(x, x') - B x')
Then substituting y_1 = x and y_2 = x' gives the first order linear system of equations:
y_2 = y_1'
y_2' = M(y_1)^-1 (m(t) - C(y_1, y_2) - B y_2)
(I've used theta and phi in my code for x and y)
def joint_angles(theta_array, t, torques, B):
phi_1 = np.array([theta_array[0], theta_array[1]])
phi_2 = np.array([theta_array[2], theta_array[3]])
def M_func(phi):
M = np.array([[a_1+2.*a_2*np.cos(phi[1]), a_3+a_2*np.cos(phi[1])],[a_3+a_2*np.cos(phi[1]), a_3]])
return np.linalg.inv(M)
def C_func(phi, phi_dot):
return a_2 * np.sin(phi[1]) * np.array([-phi_dot[1] * (2. * phi_dot[0] + phi_dot[1]), phi_dot[0]**2])
dphi_2dt = M_func(phi_1) # (torques[:, t] - C_func(phi_1, phi_2) - B # phi_2)
return dphi_2dt, phi_2
t = np.linspace(0,1,1001)
initial = theta_init[0], theta_init[1], dtheta_init[0], dtheta_init[1]
x = odeint(joint_angles, initial, t, args = (torque_array, B))
I get the error that I cannot index into torques using the t array, which makes perfect sense, however I am not sure how to have it use the current value of the torques at each time step.
I also tried putting odeint command in a for loop and only evaluating it at one time step at a time, using the solution of the function as the initial conditions for the next loop, however the function simply returned the initial conditions, meaning every loop was identical. This leads me to suspect I've made a mistake in my implementation of the standard form but I can't work out what it is. It would be preferable however to not have to call the odeint solver in a for loop every time, and rather do it all as one.
If helpful, my initial conditions and constant values are:
theta_init = np.array([10*np.pi/180, 143.54*np.pi/180])
dtheta_init = np.array([0, 0])
L_1 = 0.3
L_2 = 0.33
I_1 = 0.025
I_2 = 0.045
M_1 = 1.4
M_2 = 1.0
D_2 = 0.16
a_1 = I_1+I_2+M_2*(L_1**2)
a_2 = M_2*L_1*D_2
a_3 = I_2
Thanks for helping!
The solver uses an internal stepping that is problem adapted. The given time list is a list of points where the internal solution gets interpolated for output samples. The internal and external time lists are in no way related, the internal list only depends on the given tolerances.
There is no actual natural relation between array indices and sample times.
The translation of a given time into an index and construction of a sample value from the surrounding table entries is called interpolation (by a piecewise polynomial function).
Torque as a physical phenomenon is at least continuous, a piecewise linear interpolation is the easiest way to transform the given function value table into an actual continuous function. Of course one also needs the time array.
So use numpy.interp1d or the more advanced routines of scipy.interpolate to define the torque function that can be evaluated at arbitrary times as demanded by the solver and its integration method.

Reverse Interpolation

I have a class implementing an audio stream that can be read at varying speed (including reverse and fast varying / "scratching")... I use linear interpolation for the read part and everything works quite decently..
But now I want to implement writing to the stream at varying speed as well and that requires me to implement a kind of "reverse interpolation" i.e. Deduce the input sample vector Z that, interpolated with vector Y will produce the output X (which I'm trying to write)..
I've managed to do it for constant speeds, but generalising for varying speeds (e.g accelerating or decelerating) is proving more complicated..
I imagine this problem has been solved repeatedly, but I can't seem to find many clues online, so my specific question is if anyone has heard of this problem and can point me in the right direction (or, even better, show me a solution :)
Thanks!
I would not call it "reverse interpolation" as that does not exists (my first thought was you were talking about extrapolation!). What you are doing is still simply interpolation, just at an uneven rate.
Interpolation: finding a value between known values
Extrapolation: finding a value beyond known values
Interpolating to/from constant rates is indeed much much simpler than the generic quest of "finding a value between known values". I propose 2 solutions.
1) Interpolate to a significantly higher rate, and then just sub-sample to the nearest one (try adding dithering)
2) Solve the generic problem: for each point you need to use the neighboring N points and fit a order N-1 polynomial to them.
N=2 would be linear and would add overtones (C0 continuity)
N=3 could leave you with step changes at the halfway point between your source samples (perhaps worse overtones than N=2!)
N=4 will get you C1 continuity (slope will match as you change to the next sample), surely enough for your application.
Let me explain that last one.
For each output sample use the 2 previous and 2 following input samples. Call them S0 to S3 on a unit time scale (multiply by your sample period later), and you are interpolating from time 0 to 1. Y is your output and Y' is the slope.
Y will be calculated from this polynomial and its differential (slope)
Y(t) = At^3 + Bt^2 + Ct + D
Y'(t) = 3At^2 + 2Bt + C
The constraints (the values and slope at the endpoints on either side)
Y(0) = S1
Y'(0) = (S2-S0)/2
Y(1) = S2
Y'(1) = (S3-S1)/2
Expanding the polynomial
Y(0) = D
Y'(0) = C
Y(1) = A+B+C+D
Y'(1) = 3A+2B+C
Plugging in the Samples
D = S1
C = (S2-S0)/2
A + B = S2 - C - D
3A+2B = (S3-S1)/2 - C
The last 2 are a system of equations that are easily solvable. Subtract 2x the first from the second.
3A+2B - 2(A+B)= (S3-S1)/2 - C - 2(S2 - C - D)
A = (S3-S1)/2 + C - 2(S2 - D)
Then B is
B = S2 - A - C - D
Once you have A, B, C and D you can put in an time 't' in the polynomial to find a sample value between your known samples.
Repeat for every output sample, reuse A,B,C&D if the next output sample is still between the same 2 input samples. Calculating t each time is similar to Bresenham's line algorithm, you're just advancing by a different amount each time.

Resources