Incorporating uncertainty into a pymc3 model - statistics

I have a set of data for which I have the mean, standard deviation and number of observations for each point (i.e., I have knowledge regarding the accuracy of the measure). In a traditional pymc3 model where I look only at the means, I may do something along the lines of:
x = data['mean']
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps= pm.HalfNormal('eps', sd=1)
likelihood = pm.Normal('likelihood', mu=y, sd=eps, observed=x)
What is the best way to incorporate the information regarding the variance of the observations into the model? Obviously the result should weight low-variance observations more heavily than high-variance (less certain) observations.
One approach a statistician suggested was to do the following:
x = data['mean'] # mean of observation
x_sd = data['sd'] # sd of observation
x_n = data['n'] # of measures for observation
x_sem = x_sd/np.sqrt(x_n)
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps = pm.HalfNormal('eps', sd=1)
obs = mc.Normal('obs', mu=x, sd=x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=y, eps=eps, observed=obs)
However, when I run this I get:
TypeError: observed needs to be data but got: <class 'pymc3.model.FreeRV'>
I am running the master branch of pymc3 (3.0 has some performance issues resulting in very slow sample times).

You are close, you just need to make some small changes. The main reason is that for PyMC3 data is always constant. Check the following code:
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
mu = a + b*x
mu_est = pm.Normal('mu_est', mu, x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=mu_est, sd=x_sd, observed=x)
Notice than I keep the data fixed and I introduce the observed uncertainty at two points: for the estimation of mu_est and for the likelihood. Of course you are free to do not use x_sem or x_sd and instead estimate them, like you did in your code with the variable eps.
On a historical note, code with "random data" used to work on PyMC3 (at least for some models), but given that it was not really designed to work that way, developers decided to prevent the user from using random data, and that explains the message you got.

Related

How to define log-count ratio for multiclass text dataset (fastai)?

I am trying to follow Rachel Thomas path of sentiment classification with Naive Bayes. In the video she uses a binary dataset (pos. and neg. movie reviews). When it comes to apply Naive Bayes, this is what she does:
Defintion: log-count ratio r for each word f:
r = log (ratio of feature f in positive documents) / (ratio of feature f in negative documents)
where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.
p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)
r = np.log(pr1/pr0)
--> it is very simple to apply the log-count-ratio to a dataset with 2 labels!
Problem:
My dataset is not binary! Lets assume I have 5 labels: label_1,...,label_5
How do I get the log-count ratio r for multilabel dataset?
My approach:
p4 = np.squeeze(np.asarray(x[y.items==label_5].sum(0)))
p3 = np.squeeze(np.asarray(x[y.items==label_4].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==label_3].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==label_2].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==label_1].sum(0)))
log-count-ratio:
pr1 = (p1+1) / ((y.items==label_2).sum() + 1)
pr1_not = (p1+1) / ((y.items!=label_2).sum() + 1)
r_1 = np.log(pr1/pr1_not)
log-count-ratio:
pr2 = (p2+1) / ((y.items==label_3).sum() + 1)
pr2_not = (p2+1) / ((y.items!=label_3).sum() + 1)
r_2 = np.log(pr2/pr2_not)
...
Is this correct? Does it mean I get multiple ratios?
Yes this is correct. The “negative class” is basically all the classes but the one you are considering. So yes, you will get multiple ratios (as many as number of classes you have).
From https://marvinlsj.github.io/2018/11/23/NBSVM%20for%20sentiment%20and%20topic%20classification/
, the log-count-ratio is derived from Posterior prob ratio which is good for comparing 2 classes to get insight into which is the most probable. I guess you're trying to do one-vs-one method for multi-class problem. This will end up with 5x4/2=10 pairs of ratios for classification. If you'd like to do classification only, we normally compute Posterior prob for each class and select the best. So in your case, you just select the best from sum(log(p1)), sum(log(p2)), ..., sum(log(p5)).

Difficulty in understanding the function and the function parameters

I am trying to build a program on Minutiae based fingerprint verification and for pre-processing I need to orient the ridges and I found a related function on the internet but could not understand it properly.
I tried reading many research papers on minutiae extraction.
This is the function I found on the internet
def ridge_orient(im, gradientsigma, blocksigma, orientsmoothsigma):
rows,cols = im.shape;
#Calculate image gradients.
sze = np.fix(6*gradientsigma);
if np.remainder(sze,2) == 0:
sze = sze+1;
gauss = cv2.getGaussianKernel(np.int(sze),gradientsigma);
f = gauss * gauss.T;
fy,fx = np.gradient(f); #Gradient of Gaussian
#Gx = ndimage.convolve(np.double(im),fx);
#Gy = ndimage.convolve(np.double(im),fy);
Gx = signal.convolve2d(im,fx,mode='same');
Gy = signal.convolve2d(im,fy,mode='same');
Gxx = np.power(Gx,2);
Gyy = np.power(Gy,2);
Gxy = Gx*Gy;
#Now smooth the covariance data to perform a weighted summation of the data.
sze = np.fix(6*blocksigma);
gauss = cv2.getGaussianKernel(np.int(sze),blocksigma);
f = gauss * gauss.T;
Gxx = ndimage.convolve(Gxx,f);
Gyy = ndimage.convolve(Gyy,f);
Gxy = 2*ndimage.convolve(Gxy,f);
# Analytic solution of principal direction
denom = np.sqrt(np.power(Gxy,2) + np.power((Gxx - Gyy),2)) + np.finfo(float).eps;
sin2theta = Gxy/denom; # Sine and cosine of doubled angles
cos2theta = (Gxx-Gyy)/denom;
if orientsmoothsigma:
sze = np.fix(6*orientsmoothsigma);
if np.remainder(sze,2) == 0:
sze = sze+1;
gauss = cv2.getGaussianKernel(np.int(sze),orientsmoothsigma);
f = gauss * gauss.T;
cos2theta = ndimage.convolve(cos2theta,f); # Smoothed sine and cosine of
sin2theta = ndimage.convolve(sin2theta,f); # doubled angles
orientim = np.pi/2 + np.arctan2(sin2theta,cos2theta)/2;
return(orientim);
This code computes the structure tensor, a method to compute the local orientation in a robust way. It obtains, per pixel, a symmetric 2x2 matrix (tensor) (variables Gxx, Gyy and Gxy) whose eigenvectors indicate local orientation and whose eigenvalues indicate strength of local variation. Because it returns a matrix rather than a simple gradient vector, you can distinguish uniform regions from structures like a cross, neither of which has a strong gradient, but the cross has a strong local variation.
Some criticism about the code
fy,fx = np.gradient(f);
is a really bad way of obtaining a Gaussian derivative kernel. Here you just lose all the benefits of a Gaussian gradient, and instead compute a finite difference approximation to the gradient.
Next, the code doesn’t use the separability of the Gaussian to reduce computational complexity, which is also a shame.
In this blog post I go over the theory of Gaussian filtering and Gaussian derivatives, and in this other one I detail some practical aspects (using MATLAB, but this should readily extend to Python).

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

How to add a confusion matrix to Theano examples?

I want to make use of Theano's logistic regression classifier, but I would like to make an apples-to-apples comparison with previous studies I've done to see how deep learning stacks up. I recognize this is probably a fairly simple task if I was more proficient in Theano, but this is what I have so far. From the tutorials on the website, I have the following code:
def errors(self, y):
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
I'm pretty sure this is where I need to add the functionality, but I'm not certain how to go about it. What I need is either access to y_pred and y for each and every run (to update my confusion matrix in python) or to have the C++ code handle the confusion matrix and return it at some point along the way. I don't think I can do the former, and I'm unsure how to do the latter. I've done some messing around with an update function along the lines of:
def confuMat(self, y):
x=T.vector('x')
classes = T.scalar('n_classes')
onehot = T.eq(x.dimshuffle(0,'x'),T.arange(classes).dimshuffle('x',0))
oneHot = theano.function([x,classes],onehot)
yMat = T.matrix('y')
yPredMat = T.matrix('y_pred')
confMat = T.dot(yMat.T,yPredMat)
confusionMatrix = theano.function(inputs=[yMat,yPredMat],outputs=confMat)
def confusion_matrix(x,y,n_class):
return confusionMatrix(oneHot(x,n_class),oneHot(y,n_class))
t = np.asarray(confusion_matrix(y,self.y_pred,self.n_out))
print (t)
But I'm not completely clear on how to get this to interface with the function in question and give me a numpy array I can work with.
I'm quite new to Theano, so hopefully this is an easy fix for one of you. I'd like to use this classifer as my output layer in a number of configurations, so I could use the confusion matrix with other architectures.
I suggest using a brute force sort of a way. You need an output for a prediction first. Create a function for it.
prediction = theano.function(
inputs = [index],
outputs = MLPlayers.predicts,
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size]})
In your test loop, gather the predictions...
labels = labels + test_set_y.eval().tolist()
for mini_batch in xrange(n_test_batches):
wrong = wrong + int(test_model(mini_batch))
predictions = predictions + prediction(mini_batch).tolist()
Now create confusion matrix this way:
correct = 0
confusion = numpy.zeros((outs,outs), dtype = int)
for index in xrange(len(predictions)):
if labels[index] is predictions[index]:
correct = correct + 1
confusion[int(predictions[index]),int(labels[index])] = confusion[int(predictions[index]),int(labels[index])] + 1
You can find this kind of an implementation in this repository.

lsqcurvefit when expecting small coefficients

I've generated a plot of the attenutation seen in an electrical trace up to a frequency of 14e10 rad/s. The ydata ranges from approximately 1-10 Np/m. I'm trying to generate a fit of the form
y = A*sqrt(x) + B*x + C*x^2.
I expect A to be around 10^-6, B to be around 10^-11, and C to be around 10^-23. However, the smallest coefficient lsqcurvefit will return is 10^-7. Also, its will only return a nonzero coefficient for A, while returning 0 for B and C. The fit actually looks really good however the physics indicates that B and C should not be 0.
Here is how I'm calling the function
% measurement estimate
x_alpha = [1e-6 1e-11 1e-23];
lb = [1e-7, 1e-13, 1e-25];
ub = [1e-3, 1e-6, 1e-15];
x_alpha = lsqcurvefit(#modelfun, x_alpha, omega, alpha_t, lb,ub)
Here is the model function
function [ yhat ] = modelfun( x, xdata )
yhat = x(1)*xdata.^.5 + x(2)*xdata + x(3)*xdata.^2;
end
Is it possible to get lsqcurvefit to return such small coefficients? Is the error in rounding or is it something else? Any ways I can change the tolerance to see a fit closer to what I expect?
Found a stackoverflow page that seems to address this issue!
fit using lsqcurvefit

Resources