Conditional Probability for fake reviews - statistics

I am working on a conditional probability question.
A = probability of being legit review
B = probability of guessing correctly
P(A) = 0.98 → P(A’) = 0.02
P(B|A’) = 0.95
P(B|A) = 0.90
The question should be this: P(A’|B) =?
P(A’|B) = P(B|A’).P(A’) / P(B)
P(B) = P(B and A’) + P(B and A)
= P(B|A’). P(A’) + P(B|A). P(A)
= 0.901
P(A’|B) = P(B|A’).P(A’) / P(B)
= 0.95 x 0.02 / 0.901
= 0.021
However, my result is not listed on the choices of questions. Can you please tell me if I am missing anything? Or my logic is incorrect?

Example with numbers
This example with numbers is meant as an intuitive way to understand how Bayes' formula works:
Let's assume we have 10.000 typical reviews. We calculate what we would expect to happen with these 10.000 reviews:
9.800 are real
200 fake
To predict how many review are classified as fake:
Of the 9800 real ones, 10% are classified as fake → 9800 * 0.10 = 980
Of the 200 fake ones, 95% are classified as fake → 200 * 0.95 = 190
980 + 190 = 1.170 are classified a fake.
Now we have all the pieces we need to calculate the probability that a reviews is fake, given that it is classified as such:
All reviews that are classified as fake → 1.170
Of those, are actually fake → 190
190 / 1170 = 0.1623 or 16.23%
Using general Bayes' theorem
Let's set up the events. Note that my version of event B is slightly different from yours.
P(A): Real review
P(A'): Fake review
P(B): Predicted real
P(B'): Predicted fake
P(A'|B'): Probability that a review is actually fake, when it is predicted to be real
Now that we have our events defined, we can go ahead with Bayes:
P(A'|B') = P(A' and B') / P(B') # Bayes' formula
= P(A' and B') / (P(A and B') + P(A' and B')) # Law of total probability
We also know the following, by an adapted version of Bayes' rule:
P(A and B') = P(A) * P(B'|A )
= 0.98 * 0.10
= 0.098
P(A' and B') = P(A') * P(B'|A')
= 0.02 * 0.95
= 0.019
Putting the pieces together yields:
P(A'|B') = 0.019 / (0.098 + 0.019) = 0.1623

Related

Are there conditions where KL divergence becomes arg-symmetric? Specifically, when KL(X,Y) is maximized. Is KL(Y,X) also maximized?

Kullback Liebler divergence is famous asymmetric KL(X,Y) != KL(Y,X).
However let X* be arg_max KL(X,Y). Then what do we know about KL(Y,X*)? Is it as large as possible?
Suppose I have a binary variable Y and a much more complicated, multidimensional (but discrete) distribution X.
If I find an X that maximizes KL(X,Y) then does that X also maximize KL(Y,X) (for the same Y).
Suppose the outcome Y is getting a loan. Only 10% of people in the dataset get a loan. P(Y) = .1
However, among white males the probability of getting a loan increases to 20% P(Y|white,male) = .2
Furthermore, lets say white males make up 30% of the dataset P(WM) = .30
From this we can also deduce that WM get 60% of all loans P(WM | Y) = .6
We get
KL(WM,Y) = .2 * ln(.2/.1) + .8 *ln(.8/.9)
In the other direction we have
KL(Y,WM) = .6 * ln(.6/.3) + .4 *ln(.4 / .7)
Now obviously these 2 values do not equal eachother. However, can we prove that no other X will increase KL(Y,WM) higher than this?

Python - export the final random forests tree for Graphviz

I have a Python code with a decision tree and random forests. The decision tree finds the biggest contributor using:
contr = decisiontree.feature_importances_.max() * 100
contr_full = decisiontree.feature_importances_ * 100
#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = list(df_dmy)[location + 1]
This returns the biggest contributor in my dataset and is then exported to a Graphviz format using:
tree.export_graphviz(rpart, out_file=path_file + '\\Decision Tree Code for Graphviz.dot', filled=True,
feature_names=list(df_dmy.drop(['Reason of Removal'], axis=1).columns),
impurity=False, label=None, proportion=True,
class_names=['Unscheduled', 'Scheduled'], rounded=True)
In the case of random forests, I have managed to export every tree that is used there (100 trees):
i = 0
for tree_data in rf.estimators_:
with open('tree_' + str(i) + '.dot', 'w') as my_file:
my_file = tree.export_graphviz(tree_data , out_file = my_file)
i = i + 1
This, of course, generates 100 word files with the different trees. Not every tree however contains the information that is needed, since some trees show a different result. I do know the biggest contributor of the classifier, but I also want to see the decision tree with that result.
What I tried was:
i= 0
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold
contr = df_trees.max() * 100
contr_full = df_trees * 100
#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = print(list(df_dmy)[location + 1])
Using this, I get the error:
IndexError: list index out of range
for which I have no idea what is wrong here.
I wanted a dataframe of biggest contributors together with their contributing factors in order to filter this to the actual biggest contributor and biggest contribution. See example:
Result (in a dataframe) =
Result Contribution
0 Car 0.74
1 Bike 0.71
2 Car 0.79
Python knows already that the result from random forests gave 'car' as the biggest contributor, the first filter is to remove everything except 'car':
Result Contribution
0 Car 0.74
2 Car 0.79
Then it has to search for the highest contribution and retrieve the index.
Result Contribution
2 Car 0.79
Then it has to export the tree information corresponding to that index.
I know it is quite a long story, but I hope someone knows how to finish this code.
Regards, Ganesh
names = []
contributors = []
df = pd.DataFrame(columns=['Parameter', 'Value'])
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold
contr = tree_data.feature_importances_.max() * 100
contr_full = tree_data.feature_importances_ * 100
contr_location = pd.to_numeric(np.where(contr_full == contr)[0][0])
names.append(list(titanic_dmy.columns)[contr_location + 1])
contributors.append(contr)
df['Parameter']=np.array(names)
df['Value']=np.array(contributors)
idx = df.index[df['Value'] == df['Value'].loc[df['Value'].idxmax()]].tolist()[0]
#Export to Graphviz
tree.export_graphviz(rf.estimators_[idx], out_file=path_file + '\\RF Decision Tree for Graphviz.dot',
filled=True, max_depth=graphviz_leafs, feature_names=list(titanic_dmy.drop(['survived'],
axis=1).columns), impurity=False, label=None, proportion=True,
class_names=['Unscheduled', 'Scheduled'], rounded=True, precision=2)

How can I interpret the P-value of nonlinear term under rms: anova?

Here I demonstrated a survival model with rcs term. I was wondering whether the anova()under rms package is the way to test the linearity association? And How can I interpret the P-value of the Nonlinear term (see 0.094 here), does that support adding a rcs() term in the cox model?
library(rms)
data(pbc)
d <- pbc
rm(pbc, pbcseq)
d$status <- ifelse(d$status != 0, 1, 0)
dd = datadist(d)
options(datadist='dd')
# rcs model
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
albumin 82.80 3 <.0001
Nonlinear 4.73 2 0.094
TOTAL 82.80 3 <.0001
The proper way to test is with model comparison of the log-likelihood (aka deviance) across two models: a full and reduced:
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
m <- cph(Surv(time, status) ~ albumin, data=d)
p.val <- 1- pchisq( (m2$loglik[2]- m$loglik[2]), 2 )
You can see the difference in the inference using the less accurate Wald statistic (which in your case was not significant anyway since the p-value was > 0.05) versus this more accurate method in the example that Harrell used in his ?cph help page. Using his example:
> anova(f)
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 57.75 3 <.0001
Nonlinear 8.17 2 0.0168
sex 18.75 1 <.0001
TOTAL 75.63 4 <.0001
You would incorrectly conclude that the nonlinear term was "significant" at conventional 0.05 level. This despite the fact that code creating the model was constructed as entirely linear in age (on the log-hazard scale):
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
Create a reduced mode and compare:
f0 <- cph(S ~ age + sex, x=TRUE, y=TRUE)
anova(f0)
#-------------
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 56.64 1 <.0001
sex 16.26 1 1e-04
TOTAL 75.85 2 <.0001
The difference in deviance is not significant with 2 degrees of freedom difference:
1-pchisq((f$loglik[2]- f0$loglik[2]),2)
[1] 0.1243212
I don't know why Harrell leaves this example in, because I've taken his RMS course and know that he endorses the cross-model comparison of deviance as the more accurate approach.

JAGS: node inconsistent with parents

I am trying to apply Benedict Escoto's method from the paper "Bayesian Claim Severity with Mixed Distributions," published in Variance. I seem to be running into a JAGS simulation problem. When I run the code, JAGS gives me the following error:
Error in node ones[1] Node inconsistent with parents
The data is two-fold. First, there is a list of ground-up insurance claims. Information on their age in years, deductibles (truncation) and whether they have been capped (censoring, True or False) is provided. Second, there is a list of prior means and corresponding probability weights for a mixed exponential distribution.
One thing I noticed is that it works for one set of data of priors for mixed exponential distribution, but fails for another.
It works with:
Mean Weight
50 0.3
100 0.25
500 0.25
1500 0.1
5000 0.07
20000 0.03
But it fails with:
Mean Weight
3 0.72
14 0.19
42 0.05
138 0.02
503 0.01
1501 0.01
So it may have to do with parameter requirements for mixed exponential.
Data packet for JAGS is assembled as below
jags.data <- list(claims=(claim.df$x[!claim.df$capped]
- claim.df$truncation[!claim.df$capped]),
capped.claims=(claim.df$x[claim.df$capped]
- claim.df$truncation[claim.df$capped]),
alpha=alpha,
means=actual.means,
ones=rep(1, length(claim.df$[claim.df$capped])),
ages=claim.df$age[!claim.df$capped],
capped.ages=claim.df$age[claim.df$capped],
trend.shape=trend.shape,
trend.rate=1/trend.scale)
Notice that object "ones" is given values of 1 for each capped claim.
The initial values are supplied as below:
jags.init <- list(means=list(weights=prior.weights),
equal=list(weights=rep(1/m,m)))
Some miscellaneous values are provided as follows:
m <- length(actual.means)
alpha0 <- 20
alpha <- prior.weights * alpha0
trend.prior.mu <- .05
trend.prior.sigma <- .01
trend.scale <- trend.prior.sigma^2 / (1+trend.prior.mu)
trend.shape <- (1+trend.prior.mu)/trend.scale
The JAGS model is coded as below:
model <- "model {
weights ~ ddirch(alpha)
trend.factor ~ dgamma(trend.shape, trend.rate)
for (i in 1:length(claims)) {
buckets[i] ~ dcat(weights)
mu[i] <- means[buckets[i]] / trend.factor^ages[i]
claims[i] ~ dexp(1/mu[i])
}
for (i in 1:length(capped.claims)) {
capped.buckets[i] ~ dcat(weights)
capped.mu[i] <- means[capped.buckets[i]]/trend.factor^capped.ages[i]
prob.capped[i] <- exp(-capped.claims[i]/capped.mu[i])
ones[i] ~ dbern(prob.capped[i])
}
}"
Dirichlet, Categorical and Gamma distributions are used for priors. Ones is Bernoulli distributed to characterize claims as capped or uncapped.
Finally, the model is run in JAGS with the following:
model.out <- autorun.jags(model, data=jags.data, inits=jags.init,
monitor=c("weights","trend.factor"),
startburnin=1000, startsample=5000,
n.chains=n.chains, interactive=FALSE, thin=thin.factor)
Would anyone have an idea what goes wrong? Thanks

How to find the fundamental frequency of a guitar string sound?

I want to build a guitar tuner app for Iphone. My goal is to find the fundamental frequency of sound generated by a guitar string. I have used bits of code from aurioTouch sample provided by Apple to calculate frequency spectrum and I find the frequency with the highest amplitude . It works fine for pure sounds (the ones that have only one frequency) but for sounds from a guitar string it produces wrong results. I have read that this is because of the overtones generate by the guitar string that might have higher amplitudes than the fundamental one. How can I find the fundamental frequency so it works for guitar strings? Is there an open-source library in C/C++/Obj-C for sound analyzing (or signal processing)?
You can use the signal's autocorrelation, which is the inverse transform of the magnitude squared of the DFT. If you're sampling at 44100 samples/s, then a 82.4 Hz fundamental is about 535 samples, whereas 1479.98 Hz is about 30 samples. Look for the peak positive lag in that range (e.g. from 28 to 560). Make sure your window is at least two periods of the longest fundamental, which would be 1070 samples here. To the next power of two that's a 2048-sample buffer. For better frequency resolution and a less biased estimate, use a longer buffer, but not so long that the signal is no longer approximately stationary. Here's an example in Python:
from pylab import *
import wave
fs = 44100.0 # sample rate
K = 3 # number of windows
L = 8192 # 1st pass window overlap, 50%
M = 16384 # 1st pass window length
N = 32768 # 1st pass DFT lenth: acyclic correlation
# load a sample of guitar playing an open string 6
# with a fundamental frequency of 82.4 Hz (in theory),
# but this sample is actually at about 81.97 Hz
g = fromstring(wave.open('dist_gtr_6.wav').readframes(-1),
dtype='int16')
g = g / float64(max(abs(g))) # normalize to +/- 1.0
mi = len(g) / 4 # start index
def welch(x, w, L, N):
# Welch's method
M = len(w)
K = (len(x) - L) / (M - L)
Xsq = zeros(N/2+1) # len(N-point rfft) = N/2+1
for k in range(K):
m = k * ( M - L)
xt = w * x[m:m+M]
# use rfft for efficiency (assumes x is real-valued)
Xsq = Xsq + abs(rfft(xt, N)) ** 2
Xsq = Xsq / K
Wsq = abs(rfft(w, N)) ** 2
bias = irfft(Wsq) # for unbiasing Rxx and Sxx
p = dot(x,x) / len(x) # avg power, used as a check
return Xsq, bias, p
# first pass: acyclic autocorrelation
x = g[mi:mi + K*M - (K-1)*L] # len(x) = 32768
w = hamming(M) # hamming[m] = 0.54 - 0.46*cos(2*pi*m/M)
# reduces the side lobes in DFT
Xsq, bias, p = welch(x, w, L, N)
Rxx = irfft(Xsq) # acyclic autocorrelation
Rxx = Rxx / bias # unbias (bias is tapered)
mp = argmax(Rxx[28:561]) + 28 # index of 1st peak in 28 to 560
# 2nd pass: cyclic autocorrelation
N = M = L - (L % mp) # window an integer number of periods
# shortened to ~8192 for stationarity
x = g[mi:mi+K*M] # data for K windows
w = ones(M); L = 0 # rectangular, non-overlaping
Xsq, bias, p = welch(x, w, L, N)
Rxx = irfft(Xsq) # cyclic autocorrelation
Rxx = Rxx / bias # unbias (bias is constant)
mp = argmax(Rxx[28:561]) + 28 # index of 1st peak in 28 to 560
Sxx = Xsq / bias[0]
Sxx[1:-1] = 2 * Sxx[1:-1] # fold the freq axis
Sxx = Sxx / N # normalize S for avg power
n0 = N / mp
np = argmax(Sxx[n0-2:n0+3]) + n0-2 # bin of the nearest peak power
# check
print "\nAverage Power"
print " p:", p
print "Rxx:", Rxx[0] # should equal dot product, p
print "Sxx:", sum(Sxx), '\n' # should equal Rxx[0]
figure().subplots_adjust(hspace=0.5)
subplot2grid((2,1), (0,0))
title('Autocorrelation, R$_{xx}$'); xlabel('Lags')
mr = r_[:3 * mp]
plot(Rxx[mr]); plot(mp, Rxx[mp], 'ro')
xticks(mp/2 * r_[1:6])
grid(); axis('tight'); ylim(1.25*min(Rxx), 1.25*max(Rxx))
subplot2grid((2,1), (1,0))
title('Power Spectral Density, S$_{xx}$'); xlabel('Frequency (Hz)')
fr = r_[:5 * np]; f = fs * fr / N;
vlines(f, 0, Sxx[fr], colors='b', linewidth=2)
xticks((fs * np/N * r_[1:5]).round(3))
grid(); axis('tight'); ylim(0,1.25*max(Sxx[fr]))
show()
Output:
Average Power
p: 0.0410611012542
Rxx: 0.0410611012542
Sxx: 0.0410611012542
The peak lag is 538, which is 44100/538 = 81.97 Hz. The first-pass acyclic DFT shows the fundamental at bin 61, which is 82.10 +/- 0.67 Hz. The 2nd pass uses a window length of 538*15 = 8070, so the DFT frequencies include the fundamental period and harmonics of the string. This enables an ubiased cyclic autocorrelation for an improved PSD estimate with less harmonic spreading (i.e. the correlation can wrap around the window periodically).
Edit: Updated to use Welch's method to estimate the autocorrelation. Overlapping the windows compensates for the Hamming window. I also calculate the tapered bias of the hamming window to unbias the autocorrelation.
Edit: Added a 2nd pass with cyclic correlation to clean up the power spectral density. This pass uses 3 non-overlapping, rectangular windows length 538*15 = 8070 (short enough to be nearly stationary). The bias for cyclic correlation is a constant, instead of the Hamming window's tapered bias.
Finding the musical pitches in a chord is far more difficult than estimating the pitch of one single string or note played at a time. The overtones for the multiple notes in a chord might all be overlapping and interleaving. And all the notes in common chords may themselves be at overtone frequencies for one or more non-existent lower pitched notes.
For single notes, autocorrelation is a common technique used by some guitar tuners. But with autocorrelation, you have to be aware of some potential octave uncertainty, as guitars may produce inharmonic and decaying overtones which thus don't exactly match from pitch period to pitch period. Cepstrum and Harmonic Product Spectrum are two other pitch estimation methods which may or may not have different problems, depending on the guitar and the note.
RAPT appears to be one published algorithm for more robust pitch estimation. YIN is another.
Also Objective C is a superset of ANSI C. So you can use any C DSP routines you find for pitch estimation within an Objective C app.
Use libaubio (link) and be happy . It was one the biggest time lose for me to try to implement a fundemental frequency estimator. If you want to do it yourself I advise you follow to YINFFT method (link)

Resources