I'm working on an implementation of a Naive Bayes Classifier. Programming Collective Intelligence introduces this subject by describing Bayes Theorem as:
Pr(A | B) = Pr(B | A) x Pr(A)/Pr(B)
As well as a specific example relevant to document classification:
Pr(Category | Document) = Pr(Document | Category) x Pr(Category) / Pr(Document)
I was hoping someone could explain to me the notation used here, what do Pr(A | B) and Pr(A) mean? It looks like some sort of function but then what does the pipe ("|") mean, etc?
Pr(A | B) = Probability of A happening given that B has already happened
Pr(A) = Probability of A happening
But the above is with respect to the calculation of conditional probability. What you want is a classifier, which uses this principle to decide whether something belongs to a category based on the previous probability.
See http://en.wikipedia.org/wiki/Naive_Bayes_classifier for a complete example
I think they've got you covered on the basics.
Pr(A | B) = Pr(B | A) x Pr(A)/Pr(B)
reads: the probability of A given B is the same as the probability of B given A times the probability of A divided by the probability of B. It's usually used when you can measure the probability of B and you are trying to figure out if B is leading us to believe in A. Or, in other words, we really care about A, but we can measure B more directly, so let's start with what we can measure.
Let me give you one derivation that makes this easier for writing code. It comes from Judea Pearl. I struggled with this a little, but after I realized how Pearl helps us turn theory into code, the light turned on for me.
Prior Odds:
O(H) = P(H) / 1 - P(H)
Likelihood Ratio:
L(e|H) = P(e|H) / P(e|¬H)
Posterior Odds:
O(H|e) = L(e|H)O(H)
In English, we are saying that the odds of something you're interested in (H for hypothesis) are simply the number of times you find something to be true divided by the times you find it not to be true. So, say one house is robbed every day out of 10,000. That means that you have a 1/10,000 chance of being robbed, without any other evidence being considered.
The next one is measuring the evidence you're looking at. What is the probability of seeing the evidence you're seeing when your question is true divided by the probability of seeing the evidence you're seeing when your question is not true. Say you are hearing your burglar alarm go off. How often do you get that alarm when it's supposed to go off (someone opens a window when the alarm is on) versus when it's not supposed to go off (the wind set the alarm off). If you have a 95% chance of a burglar setting off the alarm and a 1% chance of something else setting off the alarm, then you have a likelihood of 95.0.
Your overall belief is just the likelihood * the prior odds. In this case it is:
((0.95/0.01) * ((10**-4)/(1 - (10**-4))))
# => 0.0095009500950095
I don't know if this makes it any more clear, but it tends to be easier to have some code that keeps track of prior odds, other code to look at likelihoods, and one more piece of code to combine this information.
I have implemented it in Python. It's very easy to understand because all formulas for Bayes theorem are in separate functions:
#Bayes Theorem
def get_outcomes(sample_space, f_name='', e_name=''):
outcomes = 0
for e_k, e_v in sample_space.items():
if f_name=='' or f_name==e_k:
for se_k, se_v in e_v.items():
if e_name!='' and se_k == e_name:
outcomes+=se_v
elif e_name=='':
outcomes+=se_v
return outcomes
def p(sample_space, f_name):
return get_outcomes(sample_space, f_name) / get_outcomes(sample_space, '', '')
def p_inters(sample_space, f_name, e_name):
return get_outcomes(sample_space, f_name, e_name) / get_outcomes(sample_space, '', '')
def p_conditional(sample_space, f_name, e_name):
return p_inters(sample_space, f_name, e_name) / p(sample_space, f_name)
def bayes(sample_space, f, given_e):
sum = 0;
for e_k, e_v in sample_space.items():
sum+=p(sample_space, e_k) * p_conditional(sample_space, e_k, given_e)
return p(sample_space, f) * p_conditional(sample_space, f, given_e) / sum
sample_space = {'UK':{'Boy':10, 'Girl':20},
'FR':{'Boy':10, 'Girl':10},
'CA':{'Boy':10, 'Girl':30}}
print('Probability of being from FR:', p(sample_space, 'FR'))
print('Probability to be French Boy:', p_inters(sample_space, 'FR', 'Boy'))
print('Probability of being a Boy given a person is from FR:', p_conditional(sample_space, 'FR', 'Boy'))
print('Probability to be from France given person is Boy:', bayes(sample_space, 'FR', 'Boy'))
sample_space = {'Grow' :{'Up':160, 'Down':40},
'Slows':{'Up':30, 'Down':70}}
print('Probability economy is growing when stock is Up:', bayes(sample_space, 'Grow', 'Up'))
Pr(A | B): Conditional probability of A : i.e. probability of A, given that all we know is B
Pr(A) : Prior probability of A
Pr is the probability, Pr(A|B) is the conditional probability.
Check wikipedia for details.
the pipe (|) means "given".
The probability of A given B is equal to the probability of B given A x Pr(A)/Pr(B)
Based on your question I can strongly advise that you need to read some undergraduate book on Probability Theory first. Without this you will not advance properly with your task on Naive Bayes Classifier.
I would recommend you this book http://www.athenasc.com/probbook.html or look at MIT OpenCourseWare.
The pipe is used to represent conditional probability.
Pr(A | B) = Probability of A given B
Example:
Let's say you are not feeling well and you surf the web for the symptoms. And the internet tells you that if you have these symptoms then you have XYZ disease.
In this case:
Pr(A | B) is what you are trying to find out, which is:
The probability of you having XYZ GIVEN THAT you have certain symptoms.
Pr(A) is the probability of having the disease XYZ
Pr(B) is the probability of having those symptoms
Pr(B | A) is what you find out from the internet, which is:
The probability of having the symptoms GIVEN THAT you have the disease.
Related
I have a custom (discrete) probability distribution defined somewhat in the form: f(x)/(sum(f(x')) for x' in a given discrete set X). Also, 0<=x<=1.
So I have been trying to implement it in python 3.8.2, and the problem is that the numerator and denominator both come out to be really small and python's floating point representation just takes them as 0.0.
After calculating these probabilities, I need to sample a random element from an array, whose each index may be selected with the corresponding probability in the distribution. So if my distribution is [p1,p2,p3,p4], and my array is [a1,a2,a3,a4], then probability of selecting a2 is p2 and so on.
So how can I implement this in an elegant and efficient way?
Is there any way I could use the np.random.beta() in this case? Since the difference between the beta distribution and my actual distribution is only that the normalization constant differs and the domain is restricted to a few points.
Note: The Probability Mass function defined above is actually in the form given by the Bayes theorem and f(x)=x^s*(1-x)^f, where s and f are fixed numbers for a given iteration. So the exact problem is that, when s or f become really large, this thing goes to 0.
You could well compute things by working with logs. The point is that while both the numerator and denominator might underflow to 0, their logs won't unless your numbers are really astonishingly small.
You say
f(x) = x^s*(1-x)^t
so
logf (x) = s*log(x) + t*log(1-x)
and you want to compute, say
p = f(x) / Sum{ y in X | f(y)}
so
p = exp( logf(x) - log sum { y in X | f(y)}
= exp( logf(x) - log sum { y in X | exp( logf( y))}
The only difficulty is in computing the second term, but this is a common problem, for example here
On the other hand computing logsumexp is easy enough to to by hand.
We want
S = log( sum{ i | exp(l[i])})
if L is the maximum of the l[i] then
S = log( exp(L)*sum{ i | exp(l[i]-L)})
= L + log( sum{ i | exp( l[i]-L)})
The last sum can be computed as written, because each term is now between 0 and 1 so there is no danger of overflow, and one of the terms (the one for which l[i]==L) is 1, and so if other terms underflow, that is harmless.
This may however lose a little accuracy. A refinement would be to recognize the set A of indices where
l[i]>=L-eps (eps a user set parameter, eg 1)
And then compute
N = Sum{ i in A | exp(l[i]-L)}
B = log1p( Sum{ i not in A | exp(l[i]-L)}/N)
S = L + log( N) + B
Writing a python script to calc Implied Normal Vol ; in line with Jekel article (Industry Standard).
https://jaeckel.000webhostapp.com/ImpliedNormalVolatility.pdf
They say they are using a Generalized Incomplete Gamma Function Inverse.
For a call:
F(x)=v/(K - F) -> find x that makes this true
Where F is Inverse Incomplete Gamma Function
And x = (K - F)/(T*sqrt(T) ; v is the value of a call
for that x, IV is =(K-F)/x*sqrt(T)
Example I am working with:
F=40
X=38
T=100/365
v=5.25
Vol= 20%
Using the equations I should be able to backout Vol of 20%
Scipy has upper and lower Incomplete Gamma Function Inverse in their special functions.
Lower: scipy.special.gammaincinv(a, y) : {a must be positive param}
Upper: scipy.special.gammainccinv(a, y) : {a must be positive param}
Implementation:
SIG= sympy.symbols('SIG')
F=40
T=100/365
K=38
def Objective(sig):
SIG=sig
return(special.gammaincinv(.5,((F-K)**2)/(2*T*SIG**2))+special.gammainccinv(.5,((F-K)**2)/(2*T*SIG**2))+5.25/(K-F))
x=optimize.brentq(Objective, -20.00,20.00, args=(), xtol=1.48e-8, rtol=1.48e-8, maxiter=1000, full_output=True)
IV=(K-F)/x*T**.5
Print(IV)
I know I am wrong, but Where am I going wrong / how do I fix it and use what I read in the article ?
Did you also post this on the Quantitative Finance Stack Exchange? You may get a better response there.
This is not my field, but it looks like your main problem is that brentq requires the passed Objective function to return values with opposite signs when passed the -20 and 20 arguments. However, this will not end up happening because according to the scipy docs, gammaincinv and gammainccinv always return a value between 0 and infinity.
I'm not sure how to fix this, unfortunately. Did you try implementing the analytic solution (rather than iterative root finding) in the second part of the paper?
I am trying to replicate the result of an academic paper below to practice how to apply statistical methods in R.
This is what the paper states:
Pre- and postoutbreak differences in voter intentions.
Across the 32 elections included in primary analyses, the mean voter-intention difference score was greater than zero (M=1.02%), d=0.84, t(31)=2.34, p=.026.
This result is consistent with the pre- and postelection difference in nationwide polling results for the House of Representatives elections, which indicates a general postoutbreak shift toward favoring Republican rather than Democratic candidates. (If the two outliers were
included in the analysis, the mean voter-intention difference score was not meaningfully different from zero, p=.937.) (Source screenshot)
The t-test result matched the values in the paper, however, I cannot figure out how to produce the right cohen's d estimate. I looked into the documentation for cohen.d function over and over again, googled how to do this, even watched some boring Youtube videos, but to no avail. The code itself runs, but it gives me a wrong value. Am I missing something? Can someone help me with how I should format the arguments?
# excluding outliers from the dataset
no_outliers <- study2 %>%
filter(StateSenateRace != "Rhode Island", StateSenateRace != "Hawaii")
# paired t-test
t.test(no_outliers$OctMeanVoterIntentionIndex, no_outliers$SeptMeanVoterIntentionIndex, paired = TRUE, var.equal = TRUE, na.rm = TRUE)
cohen.d(no_outliers$SeptMeanVoterIntentionIndex, no_outliers$OctMeanVoterIntentionIndex, na.rm = TRUE)
Here's the result I got.
Cohen's d
d estimate: -0.02130406 (negligible)
95 percent confidence interval:
lower upper
-0.5171041 0.4744960
Thank you in advance--I wish I could contribute to Stack Overflow as a R expert someday!
In a transport problem, I'm trying to insert the following rule into the objective function:
If a supply of BC <19,000 tons, then we will have a penalty of $ 125 / MT
I added a constraint to check the condition but would like to apply the penalty in the objective function.
I was able to do this in Excel Solver, but the values do not match. I've already checked both, and debugged the code, but I could not figure out what's wrong.
Here is the constraint:
def bc_rule(model):
return sum(model.x[supplier, market] for supplier in model.suppliers \
for market in model.markets \
if 'BC' in supplier) >= 19000
model.bc_rules = Constraint(rule=bc_rule, doc='Minimum production')
The problem is in the objective rule:
def objective_rule(model):
PENALTY_THRESHOLD = 19000
PENALTY_COST = 125
cost = sum(model.costs[supplier, market] * model.x[supplier, market] for supplier in model.suppliers for market in model.markets)
# what is the problem here?
bc = sum(model.x[supplier, market] for supplier in model.suppliers \
for market in model.markets \
if 'BC' in supplier)
if bc < PENALTY_THRESHOLD:
cost += (PENALTY_THRESHOLD - bc) * PENALTY_COST
return cost
model.objective = Objective(rule=objective_rule, sense=minimize, doc='Define objective function')
I'm getting a much lower value than found in Excel Solver.
Your condition (if) depends on a variable in your model.
Normally, ifs should never be used in a mathematical model, and that is not only for Pyomo. Even in Excel, if statements in formulas are simply converted to scalar value before optimization, so I would be very careful when saying that it is the real optimal value.
The good news is that if statements are easily converted into mathematical constraints.
For that, you need to add a binary variable (0/1) to your model. It will take the value of 1 if bc <= PENALTY_TRESHOLD. Let's call this variable y, and is defined as model.y = Var(domain=Binary).
You will add model.y * PENALTY_COST as a term of your objective function to include the penalty cost.
Then, for the constraint, add the following piece of code:
def y_big_M(model):
bigM = 10000 # Should be a big number, big enough that it will be bigger than any number in your
# model, but small enough that it will stay around the same order of magnitude. Avoid
# utterly big number like 1e12 and + if you don't need to, since having numbers too
# large causes problems.
PENALTY_TRESHOLD = 19000
return PENALTY_TRESHOLD - sum(
model.x[supplier, market]
for supplier in model.suppliers
for market in model.markets
if 'BC' in supplier
) <= model.y * bigM
model.y_big_M = Constraint(rule=y_big_M)
The previous constraint ensures that y will take a value greater than 0 (i.e. 1) when the sum that calculates bc is smaller than the PENALTY_TRESHOLD. Any value of this difference that is greater than 0 will force the model to put 1 in the value of variable y, since if y=1, the right hand side of the constraint will be 1 * bigM, which is a very big number, big enough that bc will always be smaller than bigM.
Please, also check your Excel model to see if your if statements really works during the solver computations. Last time I checked, Excel solver do not convert if statements into bigM constraints. The modeling technique I showed you works for absolutely all programming method, even in Excel.
I am dealing with a problem which is a variant of a subset-sum problem, and I am hoping that the additional constraint could make it easier to solve than the classical subset-sum problem. I have searched for a problem with this constraint but I have been unable to find a good example with an appropriate algorithm either on StackOverflow or through googling elsewhere.
The problem:
Assume you have two lists of positive numbers A1,A2,A3... and B1,B2,B3... with the same number of elements N. There are two sums Sa and Sb. The problem is to find the simultaneous set Q where |sum (A{Q}) - Sa| <= epsilon and |sum (B{Q}) - Sb| <= epsilon. So, if Q is {1, 5, 7} then A1 + A5 + A7 - Sa <= epsilon and B1 + B5 + B7 - Sb <= epsilon. Epsilon is an arbitrarily small positive constant.
Now, I could solve this as two completely separate subset sum problems, but removing the simultaneity constraint results in the possibility of erroneous solutions (where Qa != Qb). I also suspect that the additional constraint should make this problem easier than the two NP complete problems. I would like to solve an instance with 18+ elements in both lists of numbers, and most subset-sum algorithms have a long run time with this number of elements. I have investigated the pseudo-polynomial run time dynamic programming algorithm, but this has the problems that a) the speed relies on a short bit-depth of the list of numbers (which does not necessarily apply to my instance) and b) it does not take into account the simultaneity constraint.
Any advice on how to use the simultaneity constraint to reduce the run time? Is there a dynamic programming approach I could use to take into account this constraint?
If I understand your description of the problem correctly (I'm confused about why you have the distance symbols around "sum (A{Q}) - Sa" and "sum (B{Q}) - Sb", it doesn't seem to fit the rest of the explanation), then it is in NP.
You can see this by making a reduction from Subset sum (SUB) to Simultaneous subset sum (SIMSUB).
If you have a SUB problem consisting of a set X = {x1,x2,...,xn} and a target called t and you have an algorithm that solves SIMSUB when given two sets A = {a1,a2,...,an} and B = {b1,b2,...,bn}, two intergers Sa and Sb and a value for epsilon then we can solve SUB like this:
Let A = X and let B be a set of length n consisting of only 0's. Set Sa = t, Sb = 0 and epsilon = 0. You can now run the SIMSUB algorithm on this problem and get the solution to your SUB problem.
This shows that SUBSIM is as least as hard as SUB and therefore in NP.