I'm trying to fit a simple Dirichlet-Multinomial model in tensorflow probability. The concentration parameters are gamma and I have put a Gamma(1,1) prior distribution on them. This is the model, where S is the number of categories and N is the number of samples:
def dirichlet_model(S, N):
gamma = ed.Gamma(tf.ones(S)*1.0, tf.ones(S)*1.0, name='gamma')
y = ed.DirichletMultinomial(total_count=500., concentration=gamma, sample_shape=(N), name='y')
return y
log_joint = ed.make_log_joint_fn(dirichlet_model)
However, when I try to sample from this using HMC, the acceptance rate is zero, and the initial draw for gamma contains negative values. Am I doing something wrong? Shouldn't negative proposals for the concentration parameters be rejected automatically? Below my sampling code:
def target_log_prob_fn(gamma):
"""Unnormalized target density as a function of states."""
return log_joint(
S=S, N=N,
gamma=gamma,
y=y_new)
num_results = 5000
num_burnin_steps = 3000
states, kernel_results = tfp.mcmc.sample_chain(
num_results=num_results,
num_burnin_steps=num_burnin_steps,
current_state=[
tf.ones([5], name='init_gamma')*5,
],
kernel=tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=target_log_prob_fn,
step_size=0.4,
num_leapfrog_steps=3))
gamma = states
with tf.Session() as sess:
[
gamma_,
is_accepted_,
] = sess.run([
gamma,
kernel_results.is_accepted,
])
num_accepted = np.sum(is_accepted_)
print('Acceptance rate: {}'.format(num_accepted / num_results))
Try reducing step size to increase acceptance rate. Optimal acceptance rate for HMC is around .651 (https://arxiv.org/abs/1001.4460). Not sure why you'd see negative values. Maybe floating point error near zero? Can you post some of the logs of your run?
Related
I have trained ResNet50 for binary image classification.
I want to descrease FalseNegatives by reducing threshold value.
How can I do that?
To decrease the number of false negatives (FN) i.e. increase the recall (since recall = TP / (TP + FN)) you should increase the positive weight (the weight of the occurrence of that class) above 1. For example nn.BCEWithLogitsLoss allows you to provide the pos_weight option:
pos_weight > 1 increases the recall, pos_weight < 1 increases the precision.
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3. The loss would act as if the dataset contains 3*100 = 300 positive examples.
As a side note, the explicit expression for the binary cross entropy with logits (where "with logits" should rather be understood as "from logits") is:
>>> z = torch.sigmoid(q)
>>> loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
Above q are the raw logit values while w_p is the weight of the positive instance.
Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
You haven't weighted the contribution of each bin with n[i]. Change the increment of t to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids is the midpoints of the bins; it has the same length as n:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.
The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.
I have a histogram of sorted random numbers and a Gaussian overlay. The histogram represents observed values per bin (applying this base case to a much larger dataset) and the Gaussian is an attempt to fit the data. Clearly, this Gaussian does not represent the best fit to the histogram. The code below is the formula for a Gaussian.
normc, mu, sigma = 30.845, 50.5, 7 # normalization constant, avg, stdev
gauss = lambda x: normc * exp( (-1) * (x - mu)**2 / ( 2 * (sigma **2) ) )
I calculated the expectation values per bin (area under the curve) and calculated the number of observed values per bin. There are several methods to find the 'best' fit. I am concerned with the best fit possible by minimizing Chi-Squared. In this formula for Chi-Squared, the expectation value is the area under the curve per bin and the observed value is the number of occurrences of sorted data values per bin. So I want to fluctuate normc, mu, and sigma near their given values to find the right combination of normc, mu, and sigma that produce the smallest Chi-Square, as these will be the parameters I can plug into the code above to overlay the best fit Gaussian on my histogram. I am trying to use the scipy module to minimize my Chi-Square as done in this example. Since I need to fluctuate parameters, I will use the function gauss (defined above) to plot the Gaussian overlay, and will define a new function to find the minimum Chi-Squared.
def gaussmin(var,data):
# var[0] = normc
# var[1] = mu
# var[2] = sigma
# data is the sorted random numbers, represents unbinned observed values
for index in range(len(data)):
return var[0] * exp( (-1) * (data[index] - var[1])**2 / ( 2 * (var[2] **2) ) )
# I realize this will return a new value for each index of data, any guidelines to fix?
After this, I am stuck. How can I fluctuate the parameters to find the normc, mu, sigma that produced the best fit? My last attempt at a solution is below:
var = [normc, mu, sigma]
result = opt.minimize(chi2, [normc,mu,sigma])
# chi2 is the chisquare value obtained via scipy
# chisquare input (a,b)
# where a is number of occurences per bin, b is expected value per bin
# b is dependent upon normc, mu, sigma
print(result)
# data is a list, can I keep it as a constant and only fluctuate parameters in var?
There are plenty of examples online for scalar functions but I cannot find any for variable functions.
PS - I can post my full code so far but it's bit lengthy. If you would like to see it, just ask and I can post it here or provide a googledrive link.
A Gaussian distribution is completely characterized by its mean and variance (or std deviation). Under the hypothesis that your data are normally distributed, the best fit will be obtained by using x-bar as the mean and s-squared as the variance. But before doing so, I'd check whether normality is plausible using, e.g., a q-q plot.
I just applied the log loss in sklearn for logistic regression: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
My code looks something like this:
def perform_cv(clf, X, Y, scoring):
kf = KFold(X.shape[0], n_folds=5, shuffle=True)
kf_scores = []
for train, _ in kf:
X_sub = X[train,:]
Y_sub = Y[train]
#Apply 'log_loss' as a loss function
scores = cross_validation.cross_val_score(clf, X_sub, Y_sub, cv=5, scoring='log_loss')
kf_scores.append(scores.mean())
return kf_scores
However, I'm wondering why the resulting logarithmic losses are negative. I'd expect them to be positive since in the documentation (see my link above) the log loss is multiplied by a -1 in order to turn it into a positive number.
Am I doing something wrong here?
Yes, this is supposed to happen. It is not a 'bug' as others have suggested. The actual log loss is simply the positive version of the number you're getting.
SK-Learn's unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.
This is also described in sklearn GridSearchCV with Pipeline and in scikit-learn cross validation, negative values with mean squared error
a similar discussion can be found here.
In this way, an higher score means better performance (less loss).
I cross checked the sklearn implementation with several other methods. It seems to be an actual bug within the framework. Instead consider the follwoing code for calculating the log loss:
import scipy as sp
def llfun(act, pred):
epsilon = 1e-15
pred = sp.maximum(epsilon, pred)
pred = sp.minimum(1-epsilon, pred)
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
ll = ll * -1.0/len(act)
return ll
Also take into account that the dimensions of act and pred have to Nx1 column vectors.
When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!
You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""
If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.
There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)