Welford's online variance algorithm, but for Interquartile Range? - statistics

Short Version
Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).
Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?
Long Version
Keeping a running average of data, where you are memory constrainted, is pretty easy:
Double sum
Int64 count
And from this you can compute the mean:
mean = sum / count
This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.
Welford's Algorithm for Variance
Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to computer reading - mean for all previous readings:
Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count
Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:
It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.
The algorithm for adding a new value to the running variance is:
void addValue(Double newValue) {
Double oldMean = sum / count;
sum += newValue;
count += 1;
Double newMean = sum / count;
if (count > 1)
variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
else
variance = 0;
}
How about an online algorithm for Interquartile Range (IQR)?
Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:
And from that people then generally draw a IQR BoxPlot:
Or at the very least, have the values Q1 and Q3.
Is there a way to calculate the Interquartile Range without having to keep all the recorded values?
In other words:
Is there something like Welford's online variance algorithm, but for Interquartile Range?
Knuth, Seminumerical Algorithms
You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:
(just in case anyone thought this isn't computer science or programming related)
Research Effort
Stackoverflow: Simple algorithm for online outlier detection of a generic time series
Stats: Simple algorithm for online outlier detection of a generic time series
Online outlier detection for data streams (IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications, September 2011, Pages 88–96)
Stats: Robust outlier detection in financial timeseries
Stats: Online outlier detection
Distance-based outlier detection in data streams (Proceedings of the VLDB Endowment, Volume 9, Issue 12, August 2016, pp 1089–1100) pdf
Online Outlier Detection Over Data Streams (Hongyin Cui, Masters Thesis, 2005)

There's a useful paper by Ben-Haim, and Tom-Tov published in 2010 in the Journal of Machine Learning Research
A Streaming Parallel Decision Tree Algorithm
Short PDF: https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0667E2F91B9E0E5387F85655AE9BC560?doi=10.1.1.186.7913&rep=rep1&type=pdf
Full paper: https://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
It describes an algoritm to automatically create a histogram from a online (streaming) data, that does not require unlimited memory.
you add a value to the history
the algorithm dynamically creates buckets
including bucket sizes
The paper is kind of dense (as all math papers are), but the algorithm is fairly simple.
Lets start with some sample data. For this answer i'll use digits of PI as the source for incoming floating point numbers:
Value
3.14159
2.65358
9.79323
8.46264
3.38327
9.50288
4.19716
9.39937
5.10582
0.97494
4.59230
7.81640
6.28620
8.99862
8.03482
5.34211
...
I will define that i want 5 bins in my histogram.
We add the first value (3.14159), which causes the first bin to be created:
Bin
Count
3.14159
1
The bins in this histogram don't have any width; they are purely a point:
And then we add the 2nd value (2.65358) to the histogram:
And we continue adding points, until we reach our arbitrary limit of 5 "buckets":
That is all 5 buckets filled.
We add our 6th value (9.50288) to the histogram, except that means we now have 6 buckets; but we decided we only want five:
Now is where the magic starts
In order to do the streaming part, and limit memory usage to less-than-infinity, we need to merge some of the "bins".
Look at each pair of bins - left-to-right - and see which two are closest together. Which in our case is these two buckets:
These two buckets are merged, and given a new x value dependant on their relative heights (i.e. counts)
ynew = yleft + yright = 1 + 1 = 2
xnew = xleft × (yleft/ynew) + xright×(yright/ynew) = 3.14159×(1/2) + 3.38327×(1/2) = 3.26243
And now we repeat.
add a new value
merge the two neighboring buckets that are closet to each other
deciding on the new x position between based on their relative heights
Eventually giving you (although i screwed it up as i was doing it manually in Excel for this answer):
Practical Example
I wanted a histogram of 20 buckets. This allows me to extract some useful statistics. For a histogram of 11 buckets, containing 38,000 data points, it only requires 40 bytes of memory:
With these 20 buckets, i can now computer the Probably Density Function (PDF):
Bin
Count
PDF
2.113262834
3085
5.27630%
6.091181608
3738
6.39313%
10.13907062
4441
7.59548%
14.38268188
5506
9.41696%
18.92107481
6260
10.70653%
23.52148965
6422
10.98360%
28.07685659
5972
10.21396%
32.55801082
5400
9.23566%
36.93292359
4604
7.87426%
41.23715698
3685
6.30249%
45.62006198
3136
5.36353%
50.38765223
2501
4.27748%
55.34957161
1618
2.76728%
60.37095192
989
1.69149%
65.99939004
613
1.04842%
71.73292736
305
0.52164%
78.18427775
140
0.23944%
85.22261376
38
0.06499%
90.13115876
12
0.02052%
96.1987941
4
0.00684%
And with the PDF, you can now calculate the Expected Value (i.e. mean):
Bin
Count
PDF
EV
2.113262834
3085
5.27630%
0.111502092
6.091181608
3738
6.39313%
0.389417244
10.13907062
4441
7.59548%
0.770110873
14.38268188
5506
9.41696%
1.354410824
18.92107481
6260
10.70653%
2.025790219
23.52148965
6422
10.98360%
2.583505901
28.07685659
5972
10.21396%
2.86775877
32.55801082
5400
9.23566%
3.00694827
36.93292359
4604
7.87426%
2.908193747
41.23715698
3685
6.30249%
2.598965665
45.62006198
3136
5.36353%
2.446843872
50.38765223
2501
4.27748%
2.155321935
55.34957161
1618
2.76728%
1.531676732
60.37095192
989
1.69149%
1.021171415
65.99939004
613
1.04842%
0.691950026
71.73292736
305
0.52164%
0.374190474
78.18427775
140
0.23944%
0.187206877
85.22261376
38
0.06499%
0.05538763
90.13115876
12
0.02052%
0.018498245
96.1987941
4
0.00684%
0.006581183
Which gives:
Expected Value: 27.10543
Cumulative Density Function CDF
We can now also get the Cumulative Density Function (CDF):
Bin
Count
PDF
EV
CDF
2.113262834
3085
5.27630%
0.11150
5.27630%
6.091181608
3738
6.39313%
0.38942
11.66943%
10.13907062
4441
7.59548%
0.77011
19.26491%
14.38268188
5506
9.41696%
1.35441
28.68187%
18.92107481
6260
10.70653%
2.02579
39.38839%
23.52148965
6422
10.98360%
2.58351
50.37199%
28.07685659
5972
10.21396%
2.86776
60.58595%
32.55801082
5400
9.23566%
3.00695
69.82161%
36.93292359
4604
7.87426%
2.90819
77.69587%
41.23715698
3685
6.30249%
2.59897
83.99836%
45.62006198
3136
5.36353%
2.44684
89.36188%
50.38765223
2501
4.27748%
2.15532
93.63936%
55.34957161
1618
2.76728%
1.53168
96.40664%
60.37095192
989
1.69149%
1.02117
98.09814%
65.99939004
613
1.04842%
0.69195
99.14656%
71.73292736
305
0.52164%
0.37419
99.66820%
78.18427775
140
0.23944%
0.18721
99.90764%
85.22261376
38
0.06499%
0.05539
99.97264%
90.13115876
12
0.02052%
0.01850
99.99316%
96.1987941
4
0.00684%
0.00658
100.00000%
And the CDF is where we can start to get the values i want.
The median (50th percentile), where the CDF reaches 50%:
From interpolation of the data, we can find the x value where the CDF is 50%:
Bin
Count
PDF
EV
CDF
18.92107481
6260
10.70653%
2.02579
39.38839%
23.52148965
6422
10.98360%
2.58351
50.37199%
t = (50-39.38839)/(50.37199-39.38839) = 10.61161/10.9836 = 0.96613
xmedian = (1-t)*18.93107481 + (t)*23.52148965 = 23.366
So now we know:
Expected Value (mean): 27.10543
Median: 23.366
My original ask was the IQV - the x values that account from 25%-75% of the values. Once again we can interpolate the CDF:
Bin
Count
PDF
EV
CDF
10.13907062
4441
7.59548%
0.77011
19.26491%
12.7235
25.00000%
14.38268188
5506
9.41696%
1.35441
28.68187%
23.366
50.00000% (median)
23.52148965
6422
10.98360%
2.58351
50.37199% (mode)
27.10543
mean
28.07685659
5972
10.21396%
2.86776
60.58595%
32.55801082
5400
9.23566%
3.00695
69.82161%
35.4351
75.00000%
36.93292359
4604
7.87426%
2.90819
77.69587%
This can be continued to get other useful stats:
Mean (aka average, expected value)
Median (50%)
Middle quintile (middle 20%)
IQR (middle 50% range)
middle 3 quintiles (middle 60%)
1 standard deviation range (middle 68.26%)
middle 80%
middle 90%
middle 95%
2 standard deviations range (middle 95.45%)
middle 99%
3 standard deviations range (middle 99.74%)
Short Version
https://github.com/apache/spark/blob/4c7888dd9159dc203628b0d84f0ee2f90ab4bf13/sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java

Related

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

Calculating 95 % confidence interval for the mean in python

I need little help. If I have 30 random sample with mean of 52 and variance of 30 then how can i calculate the 95 % confidence interval for the mean with estimated and true variance of 30.
Here you can combine the powers of numpy and statsmodels to get you started:
To produce normally distributed floats with mean of 52 and variance of 30 you can use numpy.random.normal with numbers = np.random.normal(loc=52, scale=30, size=30) where the parameters are:
Parameters
----------
loc : float
Mean ("centre") of the distribution.
scale : float
Standard deviation (spread or "width") of the distribution.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. Default is None, in which case a
single value is returned.
And here's a 95% confidence interval of the mean using DescrStatsW.tconfint_mean:
import statsmodels.stats.api as sms
conf = sms.DescrStatsW(numbers).tconfint_mean()
conf
# output
# (36.27, 56.43)
EDIT - 1
That's not the whole story though... Depending on your sample size, you should use the Z score and not t score that's used by sms.DescrStatsW(numbers).tconfint_mean() here. And I have a feeling that its not coincidental that the rule-of-thumb threshold is 30, and that you have 30 observations in your question. Z vs t also depends on whether or not you know the population standard deviation or have to rely on an estimate from your sample. And those are calculated differently as well. Take a look here. If this is something you'd like me to explain and demonstrate further, I'll gladly take another look at it over the weekend.

How to avoid impression bias when calculate the ctr?

When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!
You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""
If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.
There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)

Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:
age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.
Can anyone tell me how to go about this problem or point me to some resources?
Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.
For example
x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)
library(cluster)
daisy(x, metric = "euclidean")
you'll get :
Dissimilarities :
1 2 3 4
2 2.000000
3 3.316625 2.236068
4 2.236068 1.732051 1.414214
5 4.242641 3.741657 1.732051 2.645751
If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this
Give each attribute an appropriate weight, and add the differences between values.
enum SkinType
Dry, Medium, Oily
enum HairLength
Bald, Short, Medium, Long
UserDifference(user1, user2)
total := 0
total += abs(user1.Age - user2.Age) * 0.1
total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
# etc...
return total
If you really need similarity instead of difference, use 1 / UserDifference(a, b)
You probably should take a look for
Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)
These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...
You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)
As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects
Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:
Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.
Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):
double Difference(Person p1, Person p2) {
double agescale=0.1;
double skinscale=0.5;
double hairscale=1;
double lifestylescale=1;
double agediff = (p1.age-p2.age)*agescale;
double skindiff = (p1.skin-p2.skin)*skinscale;
double hairdiff = (p1.hair-p2.hair)*hairscale;
double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;
double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
return diff;
}
Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.
Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings
You should read these two topics.
Most popular clustering algorithm k - means
And similarity matrix are essential in clustering

Resources