How to avoid impression bias when calculate the ctr? - statistics

When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!

You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""

If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.

There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)

Related

Ryan Joiner Normality Test P-value

i tried a lot to calculate the Ryan Joiner(RJ) P-Value. is there any method or formula to calculate RJ P-Value.
i found how to calculate RJ value but unable to find the manual calcuation part for RJ P-Value. Minitab is calculating by some startegy . i want to know that how calculate it in manually.
please support me on this.
The test statistic RJ needs to be compared to a critical value CV in order to make a determination of whether to reject or fail to reject the null hypothesis.
The value of CV depends on the sample size and confidence level desired, and the values are empirically derived: generate large numbers of normally distributed datasets for each sample size n, calculate RJ statistic for each, then CV for a=0.10 is the 10th percentile value of RJ.
Sidenote: For some reason I'm seeing a 90% confidence level used many places for Ryan-Joiner, when a 95% confidence is commonly used for other normality tests. I'm not sure why.
I recommend reading the original Ryan-Joiner 1976 paper:
https://www.additive-net.de/de/component/jdownloads/send/70-support/236-normal-probability-plots-and-tests-for-normality-thomas-a-ryan-jr-bryan-l-joiner
In that paper, the following critical value equations were empirically derived (I wrote out in python for convenience):
def rj_critical_value(n, a=0.10)
if a == 0.1:
return 1.0071 - (0.1371 / sqrt(n)) - (0.3682 / n) + (0.7780 / n**2)
elif a == 0.05:
return 1.0063 - (0.1288 / sqrt(n)) - (0.6118 / n) + (1.3505 / n**2)
elif a == 0.01:
return 0.9963 - (0.0211 / sqrt(n)) - (1.4106 / n) + (3.1791 / n**2)
else:
raise Exception("a must be one of [0.10, 0.05, 0.01]")
The RJ test statistic then needs to be compared to that critical value:
If RJ < CV, then the determination is NOT NORMAL.
If RJ > CV, then the determination is NORMAL.
Minitab is going one step further - working backwards to determine the value of a at which CV == RJ. This value would be the p-value you're referencing in your original question.

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
You haven't weighted the contribution of each bin with n[i]. Change the increment of t to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids is the midpoints of the bins; it has the same length as n:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.
The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.

Euler method to model terminal velocity NOT converging?

As per the title I am trying to model a parachutist's decent in Python 3.
I need to use an Euler method for a kinematic body. The graph plotting speed shows no signs of tending to a terminal velocity, so I'm clearly doing something very wrong!
For fault finding I have printed the list of t and speed values to the screen. Here is the code, for a function that returns lists of the values, then plots them.
def Euler_n2l(y_ini,v_yini):
delta_t=float(input("Type time interval size (in s): "))
t_n=0
t_list=[0]
y_list=[]
v_list=[v_yini]
#Calculating k from the specified parameters for: drag coefficient, cross-sectional area and air density.
k=(userC*rho0*ca)/2
while (y_ini>0): #Ending the simulation when the ground is reached
t_n += delta_t
v_yini -= delta_t*(9.81+((k/m)*(v_yini)**2))
y_ini += delta_t*v_yini
t_list.append(t_n)
y_list.append(y_ini)
v_list.append(v_yini)
if y_ini < 0:
del t_list[-2:]
del y_list[-1]
del v_list[-2:]
return t_list,y_list,v_list
ca=0.96 #Approximation for an average human cross sectional area=1.6*0.6 m^2
userC=float(input("Type a value for the drag coefficient, C_d. Sensible values are from ~1.0 - 1.3: "))
rho0=1.2 #Value given the instruction for ambient temperature and pressure
m=80 #approx weight of a man in Kg
output=Euler_n2l(39000,0)
t_list,y_list,v_list=output
plt.plot(t_list,v_list)
plt.xlabel("$Time (s)$", size=12)
plt.ylabel("$Speed (m/s)$", size=12)
plt.show()

Calculating 95 % confidence interval for the mean in python

I need little help. If I have 30 random sample with mean of 52 and variance of 30 then how can i calculate the 95 % confidence interval for the mean with estimated and true variance of 30.
Here you can combine the powers of numpy and statsmodels to get you started:
To produce normally distributed floats with mean of 52 and variance of 30 you can use numpy.random.normal with numbers = np.random.normal(loc=52, scale=30, size=30) where the parameters are:
Parameters
----------
loc : float
Mean ("centre") of the distribution.
scale : float
Standard deviation (spread or "width") of the distribution.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. Default is None, in which case a
single value is returned.
And here's a 95% confidence interval of the mean using DescrStatsW.tconfint_mean:
import statsmodels.stats.api as sms
conf = sms.DescrStatsW(numbers).tconfint_mean()
conf
# output
# (36.27, 56.43)
EDIT - 1
That's not the whole story though... Depending on your sample size, you should use the Z score and not t score that's used by sms.DescrStatsW(numbers).tconfint_mean() here. And I have a feeling that its not coincidental that the rule-of-thumb threshold is 30, and that you have 30 observations in your question. Z vs t also depends on whether or not you know the population standard deviation or have to rely on an estimate from your sample. And those are calculated differently as well. Take a look here. If this is something you'd like me to explain and demonstrate further, I'll gladly take another look at it over the weekend.

Resources