Interpreting coefficients from Logistic Regression from R - statistics

All,
I ran a logistic Regression on a set of variables both categorical and continuous with a binary event as dependent variable.
Now post modelling, I observe a set of categorical variables showing negative sign which I presume is to understand that if that categorical variable occurs high number of times then the probability of the dependent variable occurring is low.
But when I see the % of occurrence of that independent variable I see the reverse trend happening. hence the result seems to be counter intuitive. Any reason why this could happen. I have tried explaining below with a pseudo example.
Dependent Variable - E
Predictors:
1. Categorical Var - Cat1 with 2 levels (0,1)
2. Continuous Var - Con1
3. Categorical Var - Cat2 with 2 levels (0,1)
Post Modelling:
Say all are significant and the coefficients are like below,
Cat1 - (-0.6)
Con1- (0.3)
Cat2 - (-0.4)
But when I calculate the % of occurrence of Event E on Cat 1, I observe that the % of occurence is high when Cat1 is 1, which I think is counter intuitive.
Pls help in understanding this.

Coefficients of logistic regression are not directly related to the chage of probability of the event, rather it's a relative measure of the change in the odds of the event. This article has detailed derivation of how to interpret the coefficients of logistic regression. In your context, the coefficient for CAT1 is -0.6 means p(E|CAT1 = 1) < p(E|CAT1 = 0) and it's not related to exactly how big p(E|CAT1 = 1) is.

Related

Question(s) regarding computational intensity, prediction of time required to produce a result

Introduction
I have written code to give me a set of numbers in '36 by q' format ( 1<= q <= 36), subject to following conditions:
Each row must use numbers from 1 to 36.
No number must repeat itself in a column.
Method
The first row is generated randomly. Each number in the coming row is checked for the above conditions. If a number fails to satisfy one of the given conditions, it doesn't get picked again fot that specific place in that specific row. If it runs out of acceptable values, it starts over again.
Problem
Unlike for low q values (say 15 which takes less than a second to compute), the main objective is q=36. It has been more than 24hrs since it started to run for q=36 on my PC.
Questions
Can I predict the time required by it using the data I have from lower q values? How?
Is there any better algorithm to perform this in less time?
How can I calculate the average number of cycles it requires? (using combinatorics or otherwise).
Can I predict the time required by it using the data I have from lower q values? How?
Usually, you should be able to determine the running time of your algorithm in terms of input. Refer to big O notation.
If I understood your question correctly, you shouldn't spend hours computing a 36x36 matrix satisfying your conditions. Most probably you are stuck in the infinite loop or something. It would be more clear of you could share code snippet.
Is there any better algorithm to perform this in less time?
Well, I tried to do what you described and it works in O(q) (assuming that number of rows is constant).
import random
def rotate(arr):
return arr[-1:] + arr[:-1]
y = set([i for i in range(1, 37)])
n = 36
q = 36
res = []
i = 0
while i < n:
x = []
for j in range(q):
if y:
el = random.choice(list(y))
y.remove(el)
x.append(el)
res.append(x)
for j in range(q-1):
x = rotate(x)
res.append(x)
i += 1
i += 1
Basically, I choose random numbers from the set of {1..36} for the i+q th row, then rotate the row q times and assigned these rotated rows to the next q rows.
This guarantees both conditions you have mentioned.
How can I calculate the average number of cycles it requires?( Using combinatorics or otherwise).
I you cannot calculate the computation time in terms of input (code is too complex), then fitting to curve seems to be right.
Or you could create an ML model with iterations as data and time for each iteration as label and perform linear regression. But that seems to be overkill in your example.
Graph q vs time
Fit a curve,
Extrapolate to q = 36.
You might want to also graph q vs log(time) as that may give an easier fitted curve.

Inverse CDF of Poisson dist in Excel

I want to know is there a function to calculate the inverse cdf of poisson distribution? So that I can use inverse CDF of poisson to generate a set of poisson distributed random number.
A) Inverse CDF of Poisson distribution
The inverse CDF at q is also referred to as the q quantile of a distribution. For a discrete distribution distribution . the inverse CDF at q is the smallest integer x such that CDF[dist,x]≥q.. The Poisson distribution is a discrete distribution that models the number of events based on a constant rate of occurrence. The Poisson distribution can be used as an approximation to the binomial when the number of independent trials is large and the probability of success is small. A common application of the Poisson distribution is predicting the number of events over a specific time, such as the number of cars arriving at a toll plaza in 1 minute.
Formula
The probability mass function (PMF) is:
mean = λ
variance = λ
Notation
Term Description
e base of the natural logarithm
Reference: Methods and Formulas for Inverse Cumulative Distribution Functions
B) Excel Function: Excel provides the following function for the Poisson distribution:
POISSON(x, μ, cum)
where μ = the mean of the distribution and cum takes the values TRUE and FALSE
POISSON(x, μ, FALSE) = probability density function value f(x) at the value x for the Poisson distribution with mean μ.
POISSON(x, μ, TRUE)= cumulative probability distribution function F(x) at the value x for the Poisson distribution with mean μ.
Excel 2010/2013/2016 provide the additional function POISSON.DIST which is equivalent to POISSON.
Reference: Office Support POISSON.DIST Function
C) Excel doesn’t provide a worksheet function for the inverse of the Poisson distribution.
Instead you can use the following function provided by the Real Statistics Resource Pack. It’s a free download for Excel various versions.
POISSON_INV(p, μ) = smallest integer x such that POISSON(x, μ, TRUE) ≥ p
Note that the maximum value of x is 1,024,000,000. A value higher than this indicates an error.
Reference: Real Statistics Using Excel
D)
Reference to MREXCEL.COM web site a query related to your question quoted below seems to be related to your question.
Not sure if anyone can help with this. Basically I'm trying to find out how to apply the reverse of the Poisson function in excel. So as of now I have poisson(x value, mean, true-cumulative) and that lets me get the probability for that occurence. Basically I want to know how I can get the minimum/maximum x value based on a given probability.
So if I have a list of data (700 rows) and I want to find out what the minimum starting value should be given a desired average and the fact that I want the lowest value to be at the 0.05% probability. So 0.05% = (x, 35, True) solve for x. I know I can prob do this with solver, but I am trying to figure out a way to do this formulaicly without having to use the solver (as I may have to use this many times).
The code referred to here covers the inverse of the poisson formula when using True in the excel formula. It does not cover the inverse of the poisson formula when using False in the excel formula.
Re: Reverse Poisson?
Originally Posted by shg
A further mod to accommodate large means:
Code:
Function PoissonInv(Prob As Double, Mean As Double) As Variant
' shg 2011, 2012, 2014, 2015-0415
' For a Poisson process with mean Mean, returns a three-element array:
' o The smallest integer N such that POISSON(N, Mean, True) >= Prob
' o The CDF for N-1 (which is < Prob)
' o The CDF for N (which is >= Prob)
-------Reference :> https://www.mrexcel.com/forum/excel-questions/507508-reverse-poisson-2.html>
E) Why doesn't Excel have a POISSON.INV function?
Discussion on Referred web page have references to some formulas for calculating related information desired by OP.
You could use the following.
With the Poisson mean named lambda, enter the following in an newly inserted worksheet.
A1: =IF(ROWS(A$1:A1)<=4*lambda,POISSON(ROWS(A$1:A1)-1,lambda,1))
Fill A1 down into A2:A1000 (4 times as many rows as your most typical lambda value). Name the A1:A1000 range POISSON.CDF. Then use the formula
=MATCH(n,POISSON.CDF)-1
to give the results a POISSON.INV(n,lambda) function would.
If you want this for varying lambda, use the array formula
=MATCH(n,POISSON(ROW($A$1:INDEX($A:$A,4*lambda+1),lambda,1))-1
Reference Shared Link
Hope That Helps.
=MATCH(RAND(),MMULT((ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+3,1)))=COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2))))+0,MMULT((ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2,1)))=(COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+1)))+1))+0,POISSON(ROW($A$1:INDEX($A:$A,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+1))-1,lambda,1)))+(ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+3,1)))=(COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,1)))+FLOOR(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2,1)))+0)-1
It is quite slow for lambda >1000.
This expands on the array formula
=MATCH(C4,POISSON(ROW($A$1:INDEX($A:$A,4*lambda+1)),lambda,1))-1
shared above by skkakkar, by prepending the array with 0 and appending with 1, following Is there a way to concatenate two arrays in Excel without VBA? .
The rest is mostly making the array shorter by replacing 4* lambda with 6* SQRT(lambda).

how to obtain estimation from regression in excel?

I use datas in excel to produce a graphic.
Then I make a regression, and have an equation. I'd like to know what value would be obtained from the regression (for example, x = 7,6 is the value for which I wanna know an estimation of y).
It is an approximation with a 6 degree polynome.
One wimple method would be this : I have the equation, so I could use it
However, I wondered if there is a fast method to do it? Like I enter 7,6 somewhere to have the result quickly?
if you are looking at a linear regression line (straight line) you could try the forecast formula
=forecast(X, Known Ys, Known Xs)
you could also build your own equation automatically from
=linest(...)
I found the following on a site describing the capabilities of the linest function in excel:
In addition to using LOGEST to calculate statistics for other
regression types, you can use LINEST to calculate a range of other
regression types by entering functions of the x and y variables as the
x and y series for LINEST. For example, the following formula:
=LINEST(yvalues, xvalues^COLUMN($A:$C))
works when you have a single column of y-values and a single column of
x-values to calculate the cubic (polynomial of order 3) approximation
of the form:
y = m1*x + m2*x^2 + m3*x^3 + b
You can adjust this formula to calculate other types of regression,
but in some cases it requires the adjustment of the output values and
other statistics.
or look at:
=trend

formula Amplitude using FFT

I want to ask about the formula of amplitude bellow. I am using Fast Fourier Transform. So it returns real and complex numbers.
after that I must search amplitude for each frequency.
My formula is
amplitude = 10 * log (real*real + imagined*imagined)
I want to ask about this formula. What is it source? I have been search, but I don't found any source. Can anybody tell me about that source?
This is a combination of two equations:
1: Finding the magnitude of a complex number (the result of an FFT at a particular bin) - the equation for which is
m = sqrt(r^2 + i ^2)
2: Calculating relative power in decibels from an amplitude value - the equation for which is p =10 * log10(A^2/Aref^2) == 20 log10(A/Aref) where Aref is a some reference value.
By inserting m from equation 1 into a from equation 2 with ARef = 1 we get:
p = 10 log(r^2 + i ^ 2)
Note that this gives you a measure of relative signal power rather than amplitude.
The first part of the formula likely comes from the definition of Decibel, with the reference P0 set to 1, assuming with log you meant a logarithm with base 10.
The second part, i.e. the P1=real^2 + imagined^2 in the link above, is the square of the modulus of the Fourier coefficient cn at the n-th frequency you are considering.
A Fourier coefficient is in general a complex number (See its definition in the case of a DFT here), and P1 is by definition the square of its modulus. The FFT that you mention is just one way of calculating the DFT. In your case, likely the real and complex numbers you refer to are actually the real and imaginary parts of this coefficient cn.
sqrt(P1) is the modulus of the Fourier coefficient cn of the signal at the n-th frequency.
sqrt(P1)/N, is the amplitude of the Fourier component of the signal at the n-th frequency (i.e. the amplitude of the harmonic component of the signal at that frequency), with N being the number of samples in your signal. To convince yourself you need to divide by N, see this equation. However, the division factor depends on the definition/convention of Fourier transform that you use, see the note just above here, and the discussion here.

NORMDIST function is not giving the correct output

I'm trying to use NORMDIST function in Excel to create a bell curve, but the output is strange.
My mean is 0,0000583 and standard deviation is 0,0100323 so when I plug this to the function NORMDIST(0,0000583; 0,0000583; 0,0100323; FALSE) I expect to get something close to 0,5 as I'm using the same value as the mean probability of this value should be 50%, but the function gives an output of 39,77 which is clearly not correct.
Why is it like this?
A probability cannot have values greater than 1, but a density can.
The integral of the entire range of a density function is equal 1, but it can have values greater than one in specific interval. Example, a uniform distribution on the interval [0, ½] has probability density f(x) = 2 for 0 ≤ x ≤ ½ and f(x) = 0 elsewhere. See below:
          
=NORMDIST(x, mean, dev, FALSE) returns the density function. Densities are probabilities per unit. It is almost the probability of a point, but with a very tiny range interval (the derivative in the point).
shg's answer here, explain how to get a probability on a given interval with NORMIDIST and also in what occasions it can return a density greater than 1.
For a continuous variable, the probability of any particular value is zero, because there are an infinite number of values.
If you want to know the probability that a continuous random variable with a normal distribution falls in the range of a to b, use:
=NORMDIST(b, mean, dev, TRUE) - NORMDIST(a, mean, dev, TRUE)
The peak value of the density function occurs at the mean (i.e., =NORMDIST(mean, mean, dev, FALSE) ), and the value is:
=1/(SQRT(2*PI())*dev)
The peak value will exceed 1 when the deviation is less than 1 / sqrt(2pi) ~ 0.399,
which was your case.
This is an amazing answer on Cross Validated Stack Exchange (statistics) from a moderator (whuber), that addresses this issue very thoughtfully.
It is returning the probability density function whereas I think you want the cumulative distribution function (so try TRUE in place of FALSE) ref.

Resources