Memory Leak when running PYMC3 in a FOR LOOP - python-3.x

I'm using PYMC3 to fit tennis players' serve ace rates using Bayesian fitting to a Beta curve. Each time the code loops through a player, the memory use increases a little. I'm trying to do this for 400+ players over 3 different surfaces and I run out of memory after about 200 players. I don't understand why the memory doesn't get re-set after each loop iteration as I don't think I'm using info from prior loop iterations.
I think the issue may be to do with the Trace. I saw advice somewhere that I should not have trace = pm.sample(...) but rather just pm.sample(...) and then grab that data after the program has run. I'm not sure how to implement that fix and I'm hoping there's a more straightforward solution out there to what I imagine would be a fairly common problem (though I haven't seen questions on it much online).
The relevant bits of the code are shown below. Thanks in advance for your help.
import pymc3 as pm
prior_parameters = beta.fit(chart_data, floc = 0, fscale = 1)
prior_a, prior_b = prior_parameters[0:2]
for i in range(server_by_surface_pct.shape[0]):
#srv_count is number of serves taken by player i on surface j
srv_count = pivot_srv_count.iat[i, j]
#Go to next iteration of loop if no serves for player i on surface j
if np.isnan(srv_count):
continue
#ace_pct is the percent of serves from player i on surface j that are aces
ace_pct = server_by_surface_pct.iat[i,j]
#calculate ace_count (number of aces) by player i on surface j
ace_count = round(srv_count*ace_pct,0)
#zero aces is possible so replace NANs with ZERO
if np.isnan(ace_count):
ace_count = 0.0
#pm = PYMC3 -- this is the Bayesian fitting model
with pm.Model() as model:
theta_prior = pm.Beta('prior', prior_a, prior_b)
observations = pm.Binomial('obs',n = srv_count, p = theta_prior, observed = ace_count)
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
trace = pm.sample(1000, step=step, start=start, progressbar=True)
#mean of the trace is the new fitted serve percent for player i on surface j
server_by_surface_pct_fitted.iat[i,j] = np.mean(trace['prior'])

Related

Why the time taken to by an iterative algorithm to find the sum of list does not increase uniformly with size?

I wanted to see how drastic is the difference in time complexity between the iterative and recursive approaches to sum an array. So I plotted a 'time' versus 'size of the list' graph for a pretty decent range of values for size(995). What I got was pretty much what I wanted except something unexpected caught my eye.
The graph can be seen here 1
What's confusing me here are those bumps that the green line suddenly takes only for certain values and then comes back down. Why does that happen?
Here is the code I had written:
import matplotlib.pyplot as plt
from time import time
def sum_rec(lst): # Sums recursively
if len(lst) == 0:
return 0
return lst[0]+sum_rec(lst[1:])
def sum_iter(lst): # Sums iteratively
Sum = 0
for i in range(len(lst)):
Sum += i
return Sum
def check_time(lst): # Returns the time taken for both algorithms
start = time()
Sum = sum_iter(lst)
end = time()
t_iter = end - start
start = time()
Sum = sum_rec(lst)
end = time()
t_rec = end - start
return t_iter, t_rec
N = [n for n in range(995)]
T1 = [] # for iterative function
T2 = [] # for recursive function
for n in N: # values on the x-axis
lst = [i for i in range(n)]
t_iter, t_rec = check_time(lst)
T1.append(t_iter)
T2.append(t_rec)
plt.plot(N,T1)
plt.plot(N,T2) # Both plotted on graph
plt.show()
I'd say both the algorithms have a linear runtime but the recursive one has a higher constant factor, which causes the steeper slope.
Other than that:-
(1) You're mixing up the two plots.
The iterative one stays grounded while the recursive one increases.
One possible explanation may be that recursive calls make stack entries and require more computational time than iterative calls.
(2) You need to increase the size of the array as small sizes are more likely to cause spikes due to locality of reference.
(3) You need to repeat the experiment over multiple epochs to make sure random spikes due to some other process hogging the resource is distributed evenly.

How can I compute (for later uses) a wave wtih a very high frequency?

I'm running a physics simulation related to visible light, and the resulting wave function has a very, very high frequency -- cyclic frequency is on the order of 1.0e15, and the spatial frequency k is on the order of 1.0e7. Thankfully, I only use the spatial frequency, but when I calculate it for later usage (using either math or numpy), I get something that resembles a beat wave, unless I use N ~= k sample points, because I have to calculate it over a much greater range (on the order of 1.0e-3 - 1.0e-1). It produces a beat wave so consistently I spent a few hours to make sure I'm not actually calculating one. I'll also have to use fft() on the resulting wave and I'm afraid it won't work properly with a misrepresented wave.
I've tried using various amounts of sample points, but unless it's extraordinarily high (takes a good minute or two to calculate), only the prominence of beating changes. Just in case I'm misusing numpy, I tried the same thing with appending wave.value calculated by math.sin to a float array, but it had the same result.
import numpy as np
import matplotlib.pyplot as plt
mmScale = 1.0e-3
nmScale = 1.0e-9
c = 3.0e8
N = 1000
class Wave:
def __init__(self, amplitude, wavelength):
self.wavelength = wavelength*nmScale
self.amplitude = amplitude
self.omega = 2*pi*c/self.wavelength
self.k = 2*pi/self.wavelength
def value(self, time, travel):
return self.amplitude*np.sin(self.omega*time - self.k*travel)
x = np.linspace(50, 250, N)*mmScale
wave = Wave(1, 400)
y = wave.value(0.1, x)
plt.plot(x,y)
plt.show()
The code above produces a graph of the function, and you can put in different values for N to see how it gives different waveforms.
Your sampling spatial frequency is:
1/Ts = 1 / ((250-50)*mmScale) / N) = 5000 [samples/meter]
Your wave's spatial frequency is:
1/Tw = 1 / wavelength = 1 / (400e-9) = 2500000 [wavelengths/meter]
You fail to satisfy Nyquist criterion by a factor of (2*2500000 ) / 5000 = 1000.
Thus you must expect serious aliasing effects. See https://en.wikipedia.org/wiki/Aliasing.
Not much can be done to battle it. But there are some tricks that may help you depending on application. One is to represent a wave as a complex envelop around carier frequency, which is 400e-9. Please provide more detail on what you do with the wave.

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

Cosmic ray removal in spectra

Python developers
I am working on spectroscopy in a university. My experimental 1-D data sometimes shows "cosmic ray", 3-pixel ultra-high intensity, which is not what I want to analyze. So I want to remove this kind of weird peaks.
Does anybody know how to fix this issue in Python 3?
Thanks in advance!!
A simple solution could be to use the algorithm proposed by Whitaker and Hayes, in which they use modified z scores on the derivative of the spectrum. This medium post explains how it works and its implementation in python https://towardsdatascience.com/removing-spikes-from-raman-spectra-8a9fdda0ac22 .
The idea is to calculate the modified z scores of the spectra derivatives and apply a threshold to detect the cosmic spikes. Afterwards, a fixer is applied to remove the spike points and replace it by the mean values of the surrounding pixels.
# definition of a function to calculate the modified z score.
def modified_z_score(intensity):
median_int = np.median(intensity)
mad_int = np.median([np.abs(intensity - median_int)])
modified_z_scores = 0.6745 * (intensity - median_int) / mad_int
return modified_z_scores
# Once the spike detection works, the spectrum can be fixed by calculating the average of the previous and the next point to the spike. y is the intensity values of a spectrum, m is the window which we will use to calculate the mean.
def fixer(y,m):
threshold = 7 # binarization threshold.
spikes = abs(np.array(modified_z_score(np.diff(y)))) > threshold
y_out = y.copy() # So we don't overwrite y
for i in np.arange(len(spikes)):
if spikes[i] != 0: # If we have an spike in position i
w = np.arange(i-m,i+1+m) # we select 2 m + 1 points around our spike
w2 = w[spikes[w] == 0] # From such interval, we choose the ones which are not spikes
y_out[i] = np.mean(y[w2]) # and we average the value
return y_out
The answer depends a on what your data looks like: If you have access to two-dimensional CCD readouts that the one-dimensional spectra were created from, then you can use the lacosmic module to get rid of the cosmic rays there. If you have only one-dimensional spectra, but multiple spectra from the same source, then a quick ad-hoc fix is to make a rough normalisation of the spectra and remove those pixels that are several times brighter than the corresponding pixels in the other spectra. If you have only one one-dimensional spectrum from each source, then a less reliable option is to remove all pixels that are much brighter than their neighbours. (Depending on the shape of your cosmics, you may even want to remove the nearest 5 pixels or something, to catch the wings of the cosmic ray peak as well).

Incorporating uncertainty into a pymc3 model

I have a set of data for which I have the mean, standard deviation and number of observations for each point (i.e., I have knowledge regarding the accuracy of the measure). In a traditional pymc3 model where I look only at the means, I may do something along the lines of:
x = data['mean']
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps= pm.HalfNormal('eps', sd=1)
likelihood = pm.Normal('likelihood', mu=y, sd=eps, observed=x)
What is the best way to incorporate the information regarding the variance of the observations into the model? Obviously the result should weight low-variance observations more heavily than high-variance (less certain) observations.
One approach a statistician suggested was to do the following:
x = data['mean'] # mean of observation
x_sd = data['sd'] # sd of observation
x_n = data['n'] # of measures for observation
x_sem = x_sd/np.sqrt(x_n)
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps = pm.HalfNormal('eps', sd=1)
obs = mc.Normal('obs', mu=x, sd=x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=y, eps=eps, observed=obs)
However, when I run this I get:
TypeError: observed needs to be data but got: <class 'pymc3.model.FreeRV'>
I am running the master branch of pymc3 (3.0 has some performance issues resulting in very slow sample times).
You are close, you just need to make some small changes. The main reason is that for PyMC3 data is always constant. Check the following code:
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
mu = a + b*x
mu_est = pm.Normal('mu_est', mu, x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=mu_est, sd=x_sd, observed=x)
Notice than I keep the data fixed and I introduce the observed uncertainty at two points: for the estimation of mu_est and for the likelihood. Of course you are free to do not use x_sem or x_sd and instead estimate them, like you did in your code with the variable eps.
On a historical note, code with "random data" used to work on PyMC3 (at least for some models), but given that it was not really designed to work that way, developers decided to prevent the user from using random data, and that explains the message you got.

Resources