NORMDIST function is not giving the correct output - excel

I'm trying to use NORMDIST function in Excel to create a bell curve, but the output is strange.
My mean is 0,0000583 and standard deviation is 0,0100323 so when I plug this to the function NORMDIST(0,0000583; 0,0000583; 0,0100323; FALSE) I expect to get something close to 0,5 as I'm using the same value as the mean probability of this value should be 50%, but the function gives an output of 39,77 which is clearly not correct.
Why is it like this?

A probability cannot have values greater than 1, but a density can.
The integral of the entire range of a density function is equal 1, but it can have values greater than one in specific interval. Example, a uniform distribution on the interval [0, ½] has probability density f(x) = 2 for 0 ≤ x ≤ ½ and f(x) = 0 elsewhere. See below:
          
=NORMDIST(x, mean, dev, FALSE) returns the density function. Densities are probabilities per unit. It is almost the probability of a point, but with a very tiny range interval (the derivative in the point).
shg's answer here, explain how to get a probability on a given interval with NORMIDIST and also in what occasions it can return a density greater than 1.
For a continuous variable, the probability of any particular value is zero, because there are an infinite number of values.
If you want to know the probability that a continuous random variable with a normal distribution falls in the range of a to b, use:
=NORMDIST(b, mean, dev, TRUE) - NORMDIST(a, mean, dev, TRUE)
The peak value of the density function occurs at the mean (i.e., =NORMDIST(mean, mean, dev, FALSE) ), and the value is:
=1/(SQRT(2*PI())*dev)
The peak value will exceed 1 when the deviation is less than 1 / sqrt(2pi) ~ 0.399,
which was your case.
This is an amazing answer on Cross Validated Stack Exchange (statistics) from a moderator (whuber), that addresses this issue very thoughtfully.

It is returning the probability density function whereas I think you want the cumulative distribution function (so try TRUE in place of FALSE) ref.

Related

Difference in T-distribution value calculated from excel and manually checked from table

I want to find the value of a T-distribution for a level of significance set at 5% and degrees of freedom equaling 10 in an excel sheet. When manually calculating from a table, I found the value is 2.228, but Excel gives a value of 0.961. Am I doing something wrong here?
I used the following equation in Excel for the two-tailed test.
T.DIST.2T(0.05,10) = 0.961
This is the t-distribution table.
From this table, the value for the 5% level of significance and 10 degrees of freedom is 2.228.
The function to find critical value needs to be inverse function. If you refer to the T.DIST.2T function documentation, you can observe that X argument is the value at which to evaluate the distribution; not level of significance.
What you need is the inverse function T.INV.2T function documentation.
T.INV.2T(0,05;10) = 2,228138852

Question(s) regarding computational intensity, prediction of time required to produce a result

Introduction
I have written code to give me a set of numbers in '36 by q' format ( 1<= q <= 36), subject to following conditions:
Each row must use numbers from 1 to 36.
No number must repeat itself in a column.
Method
The first row is generated randomly. Each number in the coming row is checked for the above conditions. If a number fails to satisfy one of the given conditions, it doesn't get picked again fot that specific place in that specific row. If it runs out of acceptable values, it starts over again.
Problem
Unlike for low q values (say 15 which takes less than a second to compute), the main objective is q=36. It has been more than 24hrs since it started to run for q=36 on my PC.
Questions
Can I predict the time required by it using the data I have from lower q values? How?
Is there any better algorithm to perform this in less time?
How can I calculate the average number of cycles it requires? (using combinatorics or otherwise).
Can I predict the time required by it using the data I have from lower q values? How?
Usually, you should be able to determine the running time of your algorithm in terms of input. Refer to big O notation.
If I understood your question correctly, you shouldn't spend hours computing a 36x36 matrix satisfying your conditions. Most probably you are stuck in the infinite loop or something. It would be more clear of you could share code snippet.
Is there any better algorithm to perform this in less time?
Well, I tried to do what you described and it works in O(q) (assuming that number of rows is constant).
import random
def rotate(arr):
return arr[-1:] + arr[:-1]
y = set([i for i in range(1, 37)])
n = 36
q = 36
res = []
i = 0
while i < n:
x = []
for j in range(q):
if y:
el = random.choice(list(y))
y.remove(el)
x.append(el)
res.append(x)
for j in range(q-1):
x = rotate(x)
res.append(x)
i += 1
i += 1
Basically, I choose random numbers from the set of {1..36} for the i+q th row, then rotate the row q times and assigned these rotated rows to the next q rows.
This guarantees both conditions you have mentioned.
How can I calculate the average number of cycles it requires?( Using combinatorics or otherwise).
I you cannot calculate the computation time in terms of input (code is too complex), then fitting to curve seems to be right.
Or you could create an ML model with iterations as data and time for each iteration as label and perform linear regression. But that seems to be overkill in your example.
Graph q vs time
Fit a curve,
Extrapolate to q = 36.
You might want to also graph q vs log(time) as that may give an easier fitted curve.

Inverse CDF of Poisson dist in Excel

I want to know is there a function to calculate the inverse cdf of poisson distribution? So that I can use inverse CDF of poisson to generate a set of poisson distributed random number.
A) Inverse CDF of Poisson distribution
The inverse CDF at q is also referred to as the q quantile of a distribution. For a discrete distribution distribution . the inverse CDF at q is the smallest integer x such that CDF[dist,x]≥q.. The Poisson distribution is a discrete distribution that models the number of events based on a constant rate of occurrence. The Poisson distribution can be used as an approximation to the binomial when the number of independent trials is large and the probability of success is small. A common application of the Poisson distribution is predicting the number of events over a specific time, such as the number of cars arriving at a toll plaza in 1 minute.
Formula
The probability mass function (PMF) is:
mean = λ
variance = λ
Notation
Term Description
e base of the natural logarithm
Reference: Methods and Formulas for Inverse Cumulative Distribution Functions
B) Excel Function: Excel provides the following function for the Poisson distribution:
POISSON(x, μ, cum)
where μ = the mean of the distribution and cum takes the values TRUE and FALSE
POISSON(x, μ, FALSE) = probability density function value f(x) at the value x for the Poisson distribution with mean μ.
POISSON(x, μ, TRUE)= cumulative probability distribution function F(x) at the value x for the Poisson distribution with mean μ.
Excel 2010/2013/2016 provide the additional function POISSON.DIST which is equivalent to POISSON.
Reference: Office Support POISSON.DIST Function
C) Excel doesn’t provide a worksheet function for the inverse of the Poisson distribution.
Instead you can use the following function provided by the Real Statistics Resource Pack. It’s a free download for Excel various versions.
POISSON_INV(p, μ) = smallest integer x such that POISSON(x, μ, TRUE) ≥ p
Note that the maximum value of x is 1,024,000,000. A value higher than this indicates an error.
Reference: Real Statistics Using Excel
D)
Reference to MREXCEL.COM web site a query related to your question quoted below seems to be related to your question.
Not sure if anyone can help with this. Basically I'm trying to find out how to apply the reverse of the Poisson function in excel. So as of now I have poisson(x value, mean, true-cumulative) and that lets me get the probability for that occurence. Basically I want to know how I can get the minimum/maximum x value based on a given probability.
So if I have a list of data (700 rows) and I want to find out what the minimum starting value should be given a desired average and the fact that I want the lowest value to be at the 0.05% probability. So 0.05% = (x, 35, True) solve for x. I know I can prob do this with solver, but I am trying to figure out a way to do this formulaicly without having to use the solver (as I may have to use this many times).
The code referred to here covers the inverse of the poisson formula when using True in the excel formula. It does not cover the inverse of the poisson formula when using False in the excel formula.
Re: Reverse Poisson?
Originally Posted by shg
A further mod to accommodate large means:
Code:
Function PoissonInv(Prob As Double, Mean As Double) As Variant
' shg 2011, 2012, 2014, 2015-0415
' For a Poisson process with mean Mean, returns a three-element array:
' o The smallest integer N such that POISSON(N, Mean, True) >= Prob
' o The CDF for N-1 (which is < Prob)
' o The CDF for N (which is >= Prob)
-------Reference :> https://www.mrexcel.com/forum/excel-questions/507508-reverse-poisson-2.html>
E) Why doesn't Excel have a POISSON.INV function?
Discussion on Referred web page have references to some formulas for calculating related information desired by OP.
You could use the following.
With the Poisson mean named lambda, enter the following in an newly inserted worksheet.
A1: =IF(ROWS(A$1:A1)<=4*lambda,POISSON(ROWS(A$1:A1)-1,lambda,1))
Fill A1 down into A2:A1000 (4 times as many rows as your most typical lambda value). Name the A1:A1000 range POISSON.CDF. Then use the formula
=MATCH(n,POISSON.CDF)-1
to give the results a POISSON.INV(n,lambda) function would.
If you want this for varying lambda, use the array formula
=MATCH(n,POISSON(ROW($A$1:INDEX($A:$A,4*lambda+1),lambda,1))-1
Reference Shared Link
Hope That Helps.
=MATCH(RAND(),MMULT((ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+3,1)))=COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2))))+0,MMULT((ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2,1)))=(COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+1)))+1))+0,POISSON(ROW($A$1:INDEX($A:$A,MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+1))-1,lambda,1)))+(ROW(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+3,1)))=(COLUMN(INDIRECT(ADDRESS(1,1)&":"&ADDRESS(1,1)))+FLOOR(MAX(lambda,5+lambda* 45/50)+6* SQRT(lambda)+2,1)))+0)-1
It is quite slow for lambda >1000.
This expands on the array formula
=MATCH(C4,POISSON(ROW($A$1:INDEX($A:$A,4*lambda+1)),lambda,1))-1
shared above by skkakkar, by prepending the array with 0 and appending with 1, following Is there a way to concatenate two arrays in Excel without VBA? .
The rest is mostly making the array shorter by replacing 4* lambda with 6* SQRT(lambda).

meaning V value, wilcoxen signed rank test

I have a question about my results of the Wilcoxon signed rank test:
My data consists of a trial with 2 groups (paired) in which a treatment was used. The results were scored in %. Groups consist of 131 people.
When I run the test in R, I got the following result:
wilcox.test(no.treatment, with.treatment, paired=T)
# Wilcoxon signed rank test with continuity correction
# data: no.treatment and with.treatment V = 3832, p-value = 0.7958
# alternative hypothesis: true location shift is not equal to 0
I am wondering what the V value means. I read somewhere that it has something to do with the number of positive scores (?), but I am wondering if it could tell me anything about the data and interpretation?
I'll give a little bit of background before answering your question.
The Wilcoxon signed rank sum test compares two values between the same N people (here 131), like for example blood values were measured for 131 people at two time points. The purpose of the test is to see whether the blood values have changed.
The V-statistic you are getting does not have a direct interpretation. This value is based on the pairwise difference between the individuals in your two groups. It is a value for a variable, that is supposed to follow a certain probability distribution. Intuitively speaking, you can say that the larger the value for V, the larger the difference between the two groups you sampled.
As always in hypothesis testing, you (well, the wilcox.test function) will calculate the probability that the value (V) of that variable is equal to 3832 or larger
prob('observing a value of 3832 or larger, when the groups are actually the same')
If there is really no difference between the two groups, the value for V will be 'close to zero'. Whether the value V you see is 'close to zero' depends on the probability distribution. The probability distribution is not straightforward for this variable, but luckily that doesn't matter since wilcoxon knows the distribution and calculates the probability for you (0.7958).
In short
Your groups do not significantly differ and V doesn't have a clear interpretation.
The V statistic produced by the function wilcox.test() can be calculated in R as follows:
# Create random data between -0.5 and 0.5
da_ta <- runif(1e3, min=-0.5, max=0.5)
# Perform Wilcoxon test using function wilcox.test()
wilcox.test(da_ta)
# Calculate the V statistic produced by wilcox.test()
sum(rank(abs(da_ta))[da_ta > 0])
The user MrFlick provided the above answer in reply to this question:
How to get same results of Wilcoxon sign rank test in R and SAS.
The Wilcoxon W statistic is not the same as the V statistic, and can be calculated in R as follows:
# Calculate the Wilcoxon W statistic
sum(sign(da_ta) * rank(abs(da_ta)))
The above statistic can be compared with the Wilcoxon probability distribution to obtain the p-value. There is no simple formula for the Wilcoxon distribution, but it can be simulated using Monte Carlo simulation.
The value of V does not mean the number of positive scores, but the sum of these positive scores.
As well there is a measurement for the sum for the negative scores, that this test does not provide. A brief script for calculating the sum for positive and for negative scores is provided in the following example:
a <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
b <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)
diff <- c(a - b) #calculating the vector containing the differences
diff <- diff[ diff!=0 ] #delete all differences equal to zero
diff.rank <- rank(abs(diff)) #check the ranks of the differences, taken in absolute
diff.rank.sign <- diff.rank * sign(diff) #check the sign to the ranks, recalling the signs of the values of the differences
ranks.pos <- sum(diff.rank.sign[diff.rank.sign > 0]) #calculating the sum of ranks assigned to the differences as a positive, ie greater than zero
ranks.neg <- -sum(diff.rank.sign[diff.rank.sign < 0]) #calculating the sum of ranks assigned to the differences as a negative, ie less than zero
ranks.pos #it is the value V of the wilcoxon signed rank test
[1] 80
ranks.neg
[1] 40
CREDITS: https://www.r-bloggers.com/wilcoxon-signed-rank-test/
(They also provide a nice context for it.)
You can compare also both of these numbers to their average (in this case, 60), that would be the expected value for each side, i.e. positive ranks summing 60 and negative ranks summing 60 means complete equivalence of the sides. Do positive ranks summing 80 and negative ranks summing 40 can also be considered equivalent? (i.e. could we just attribute this difference of "20" to stochastic reasons or is this distant enough for us to reject the hypothesis of no-difference?)
So, as they explain, the critical interval for this case is [25,95]. Checking on a table for critical values for the Wilcoxon rank signed test, the critical value for this example is 25 (15 pairs at 5% on a two-tailed test; and 120-25 = 95...). Meaning that the interval [40,80] is not "big enough" to discard the possibility that the differences are purely due to random sampling. (Consistently, the p-value is above the alpha).
To compare the sum of positive scores to the sum of negative scores helps to determine the significance of the difference, it enriches the analysis. Also, the positive ranks themselves are input for the calculation of the p-value of the test, therefore the interest in them.
But to extract meaning from a simply reported sum of positive ranks (V), only, I think that is not straightforward. In terms of providing information, I believe that the least to do is to also check the sum of the negative ranks, too, to have a more consistent idea of what is happening. (of course, along with general info, like sample size, p-value, etc).
I, too, was confused about this seemingly mysterious "V" statistic. I realize there are already some helpful answers here, but I did not really understand them when I first read over them. So here I am explain it again in a way that I finally understood it. Hopefully it helps others if they are also still confused.
The V-statistic is the sum of ranks assigned to the differences with positive signs. Meaning, when you run a Wilcoxon Signed Rank test, it calculates a sum of negative ranks (W-) and a sum of positive ranks (W+). The test statistic (W) is usually the minimum value either (W-) or (W+), however the V-statistic is just going to be (W+).
To understand the importance of this, if the null hypothesis is true, (W+) and (W-) would be similar. This is because given the number of samples (n), your (W+) and (W-) will have a maximum possible combined value or, (W+)+(W-)=n(n+1)/2. If this maximum value is divided somewhat evenly, than there is not much of a difference between the paired sample sets and we accept the null. If there is a large difference between (W+) and (W-) than there is a large difference between the paired sample sets, and we have evidence for the alternative hypothesis. The degree of the difference and its significance relates to the critical value chart for W.
Here are particularly helpful sites to check out if the concept is still not 100%:
1.) https://mathcracker.com/wilcoxon-signed-ranks
2.) https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric6.html)
3.) https://www.youtube.com/watch?v=TqCg2tb4wJ0
TLDR; the V-statistic reported by R is the same as the W-statistic in cases where (W+) is the smaller of (W+) or (W-).

formula Amplitude using FFT

I want to ask about the formula of amplitude bellow. I am using Fast Fourier Transform. So it returns real and complex numbers.
after that I must search amplitude for each frequency.
My formula is
amplitude = 10 * log (real*real + imagined*imagined)
I want to ask about this formula. What is it source? I have been search, but I don't found any source. Can anybody tell me about that source?
This is a combination of two equations:
1: Finding the magnitude of a complex number (the result of an FFT at a particular bin) - the equation for which is
m = sqrt(r^2 + i ^2)
2: Calculating relative power in decibels from an amplitude value - the equation for which is p =10 * log10(A^2/Aref^2) == 20 log10(A/Aref) where Aref is a some reference value.
By inserting m from equation 1 into a from equation 2 with ARef = 1 we get:
p = 10 log(r^2 + i ^ 2)
Note that this gives you a measure of relative signal power rather than amplitude.
The first part of the formula likely comes from the definition of Decibel, with the reference P0 set to 1, assuming with log you meant a logarithm with base 10.
The second part, i.e. the P1=real^2 + imagined^2 in the link above, is the square of the modulus of the Fourier coefficient cn at the n-th frequency you are considering.
A Fourier coefficient is in general a complex number (See its definition in the case of a DFT here), and P1 is by definition the square of its modulus. The FFT that you mention is just one way of calculating the DFT. In your case, likely the real and complex numbers you refer to are actually the real and imaginary parts of this coefficient cn.
sqrt(P1) is the modulus of the Fourier coefficient cn of the signal at the n-th frequency.
sqrt(P1)/N, is the amplitude of the Fourier component of the signal at the n-th frequency (i.e. the amplitude of the harmonic component of the signal at that frequency), with N being the number of samples in your signal. To convince yourself you need to divide by N, see this equation. However, the division factor depends on the definition/convention of Fourier transform that you use, see the note just above here, and the discussion here.

Resources