Are there conditions where KL divergence becomes arg-symmetric? Specifically, when KL(X,Y) is maximized. Is KL(Y,X) also maximized?

Are there conditions where KL divergence becomes arg-symmetric? Specifically, when KL(X,Y) is maximized. Is KL(Y,X) also maximized? - statistics

Kullback Liebler divergence is famous asymmetric KL(X,Y) != KL(Y,X).
However let X* be arg_max KL(X,Y). Then what do we know about KL(Y,X*)? Is it as large as possible?
Suppose I have a binary variable Y and a much more complicated, multidimensional (but discrete) distribution X.
If I find an X that maximizes KL(X,Y) then does that X also maximize KL(Y,X) (for the same Y).
Suppose the outcome Y is getting a loan. Only 10% of people in the dataset get a loan. P(Y) = .1
However, among white males the probability of getting a loan increases to 20% P(Y|white,male) = .2
Furthermore, lets say white males make up 30% of the dataset P(WM) = .30
From this we can also deduce that WM get 60% of all loans P(WM | Y) = .6
We get
KL(WM,Y) = .2 * ln(.2/.1) + .8 *ln(.8/.9)
In the other direction we have
KL(Y,WM) = .6 * ln(.6/.3) + .4 *ln(.4 / .7)
Now obviously these 2 values do not equal eachother. However, can we prove that no other X will increase KL(Y,WM) higher than this?

Related

P-values for two tailed binomial test exceed 1

Say I want to test if a coin is fair.
An experiment is performed to determine whether a coin flip is fair (50% chance
of landing heads or tails) or unfairly biased, either toward heads (> 50% chance of
landing heads) or toward tails (< 50% chance of landing heads). Since we consider both biased alternatives, a two-tailed test is performed.
H0 = Coin is fair
H1 = Coin is unfair
Here is the experiment result: 10 Heads and 10 Tails
Then I calculate the probability of events assume H0 the coin is fair
The probability of exactly, or more than, 10 Heads out of 20 tosses is p = .588
By symmetry, the probability of exactly, or more than, 10 Tails out of 20 tosses is the same, .588
Thus, the p-value for the coin turning up the same face 10 times out of 20 total flips is .588 + .588 = 1.176 > 1
But p-value cannot be larger than 1, may I know what is wrong here?
Ref:
PROBABILITY VALUE (p-Value)
Binomial Test Calculator

Case of 10 Heads + 10 Tails is accounted for in both probabilities.
You can see that P(10T) = 0.176, and 0.588 + 0.588 = 1.176 = 1 + P(10T + 10H)

The general relationship for summing event probabilities is
P(A∪B) = P(A) + P(B) - P(A∩B)
Expressed in words, you can only use a simple sum of probabilities for disjoint events.
Your events are A = {#Heads ≥ 10} and B = {#Tails ≥ 10}. {#Tails ≥ 10} => {#Heads ≤ 10}, which makes it clear that the two events are not disjoint because they both include the outcome {#Heads = 10}. Your claim that the probability exceeds 1 fails because you've neglected that intersection term, which has P{#Heads = 10} = 0.176.

How do you find the sample space of flipping unfair coins?

So, usually for unbiased coins, the probability of getting 2 heads out of 3 flips is - 3C2 * 1/2 * 1/2 * 1/2 = 3/8, since we know, the formula for probability is likely events divided by all possible events; we can say that there are 8 possible events here.
Now flip an unbiased coin with the probability of getting heads 80% of the time,
so the probability of getting 2 heads out of 3 flips is -
3C2 * 0.8 * 0.8 * 0.2 = 3/7.8125, so is the sample space 7.8125 here ?

It is still 8. 8 possible results. It's all about classical definition of probability.
In first example (p=50%) each possible result (for example {head, head, not_head}) has the same probability, that's why we can calculate
**total_prob = count_success/count_total = 3*1.000/8 = 0.375**
In the second (p=80%) we don't have this assumption anymore, so cannot use classical definition of probability (count_success/count_total), so we have to calculate
**total_prob = sum_success/count_total = 3*1.024/8 = 0.384**
In general, You can imagine, that in 1st example each result has weight=1.000, and in 2nd example results have different weights (for example {head, head, not_head} has weight=1.024 and {not_head, not_head, not_head} has weight=0.064)

How to convert intensities to Probabilities in a point pattern using Spatstat in R?

I have two points pattern (ppp) objects p1 and p2. There are X and Y points in p1 and p2 respectively. I have fitted a ppm model (with location coordinates as independent variables) in p1 and then used it to predict "intensity" for each of the Y points in p2.
Now I want to get the probability for event occurrence at that point/zone in p2. How can I use the predicted intensities for this purpose?
Can I do this using Spatstat?
Are there any other alternative.

The intensity is the expected number of points per unit area. In small areas (such as pixels) you can just multiply the intensity by the pixel area to get the probability of presence of a point in the pixel.
fit <- ppm(p1, .......)
inten <- predict(fit)
pixarea <- with(inten, xstep * ystep)
prob <- inten * pixarea
This rule is accurate provided the prob values are smaller than about 0.4.
In a larger region W, the expected number of points is the integral of the intensity function over that region:
EW <- integrate(inten, domain=W)
The result EW is a numeric value, the expected total number of points in W. To get the probability of at least one point,
P <- 1- exp(-EW)
You can also compute prediction intervals for the number of points, using predict.ppm with argument interval="prediction".

Your question, objective and current method are not very clear to me. It
would be beneficial, if you could provide code and graphics, that explains
more clearly what you have done, and what you are trying to obtain. If you
cannot share your data you can use e.g. the built-in dataset chorley as an
example (or simply simulate artificial data):
library(spatstat)
plot(chorley, cols = c(rgb(0,0,0,1), rgb(.8,0,0,.2)))
X <- split(chorley)
X1 <- X$lung
X2 <- X$larynx
mod <- ppm(X1 ~ polynom(x, y, 2))
inten <- predict(mod)
summary(inten)
#> real-valued pixel image
#> 128 x 128 pixel array (ny, nx)
#> enclosing rectangle: [343.45, 366.45] x [410.41, 431.79] km
#> dimensions of each pixel: 0.18 x 0.1670312 km
#> Image is defined on a subset of the rectangular grid
#> Subset area = 315.291058349571 square km
#> Subset area fraction = 0.641
#> Pixel values (inside window):
#> range = [0.002812544, 11.11172]
#> integral = 978.5737
#> mean = 3.103715
plot(inten)
Predicted intensities at the 58 locations in X2
intenX2 <- predict.ppm(mod, locations = X2)
summary(intenX2)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.1372 4.0025 6.0544 6.1012 8.6977 11.0375
These predicted intensities intenX2[i] say that in a small neighbourhood
around each point X2[i] the estimated number of points from X1 is Poisson
distributed with mean intenX2[i] times the area of the small neighbourhood.
So in fact you have estimated a model where in any small area you have a
probability distribution for any number of points happening in that area. If
you want the distribution in a bigger region you just have to integrate the
intensity over that region.
To get a better answer you have to provide more details about your problem.
Created on 2018-12-12 by the reprex package (v0.2.1)

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.

You haven't weighted the contribution of each bin with n[i]. Change the increment of t to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids is the midpoints of the bins; it has the same length as n:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.

The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.

How to avoid impression bias when calculate the ctr?

When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!

You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""

If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.

There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Are there conditions where KL divergence becomes arg-symmetric? Specifically, when KL(X,Y) is maximized. Is KL(Y,X) also maximized? - statistics

Related

P-values for two tailed binomial test exceed 1

How do you find the sample space of flipping unfair coins?

How to convert intensities to Probabilities in a point pattern using Spatstat in R?

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

How to avoid impression bias when calculate the ctr?

Categories

Resources