Excel's TDIST in SciPy - excel-formula

I have an excel sheet which does a one sample 2 tailed t-test to calculate statistical significance.
The last step of the calculation uses the excel TDIST formula with these arguments:
t-stat
degree of freedom
tails (called with 2 for a 2 tailed test)
What is the exact SciPy equivalent?

from scipy import stats
stats.t.sf(t_statistic, df=degrees_of_freedom) * 2
This answer has more information.

Related

Find the Sum of Cumulative Proability in Excel

Im wondering how I would calculate the Sum of a Cumulative Probability in Excel?
I have attached the column of values that I am working with. Any help is appreciated
I have tried finding the mean/average of the values and then std deviation, then using the norm distribution function and then sum those values but it doesn't seem to be creating the right value.
You can use NORM.DIST(x,mean,standard_dev,cumulative) which allows you to specify the mean and the standard deviation. If the last argument is TRUE it returns the cumulative probability. Obviously, under the assumption, the distribution of your data corresponds to the Normal Distribution. If you are not sure about that, then you need to run a normality test that will confirm that first (anyway most natural phenomenons are distributed as Normal).
For the mean, you can use the AVERAGE function, and for the Standard Deviation STDEV.S.
So on cell D4 put the following formula to calculate the cumulative probability for 0.25:
=NORM.DIST(D3,D1,D2, TRUE)
So if your data correspond to a Normal Distribution, then the cumulative probability for 0.25 will be 0.361494.

Plotly 25th and 75th Percentile is different from Pandas and Numpy 25th and 75th Percentile

I am using plotly boxplot but I found that the Q1 and Q3 numbers are very different from the 25th Percentile and 75 Percentile numbers from pandas and numpy, which is what I wanted my plotly boxplot to show.
Is there anyway to solve this issue?
Percentile from Pandas describe function
DateTime Mean Median 25% Percentile 75% Percentile
254 2020-12-24 09:00:00 19479.529412 18155.0 17695.0 19259.0
Percentile from Numpy
DateTimeStarted mean median percentile_25 percentile_75
254 2020-12-24 09:00:00 19479.529412 18155.0 17695.0 19259.0
Refer to https://numpy.org/doc/stable/reference/generated/numpy.percentile.html#r08bde0ebf37b-1
Notice there are multiple methods (9 at the moment of writing) of calculating the percentile in numpy, the default is 'linear'.
=================================================================
Now looking at the plotly documentation, unfortunately the method for calculating percentile can only be changed under the plotly.graph_objects.Box() function, but not the higher level plotly.express
Refer to this link:
https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Box.html
Scroll down you will see one of the parameters:
"quartilemethod – Sets the method used to compute the sample’s Q1 and Q3
quartiles. The “linear” method uses the 25th percentile for Q1 and 75th percentile for Q3 as computed using method #10 (listed on http://www.amstat.org/publications/jse/v14n3/langford.html). The “exclusive” method uses the median to divide the ordered dataset into two halves if the sample is odd, it does not include the median in either half - Q1 is then the median of the lower half and Q3 the median of the upper half. The “inclusive” method also uses the median to divide the ordered dataset into two halves but if the sample is odd, it includes the median in both halves - Q1 is then the median of the lower half and Q3 the median of the upper half."
Of course, the 'linear' method specified here can be different from the 'linear' method in numpy and pandas. The link provided in the plotly doc is dead so I have no idea what exact method they are using. There are only three options, 'exclusive' , 'inclusive' and 'linear'. So I suggest you play around with these three to see which one match the numpy default result.
In any case, the difference should be minimal and should only affects the "borderline" outliers.

stats, t-test, lists, and only one output

t-tests have been yielding just one output...but I want 10 t-tests.
Each t-test should compare each one of the 10 values in list to 0.
I have tried the below:
import scipy
from scipy import stats
list2=[0.10415380403918414, 0.09142102934943379, 0.08340408682911706, 0.07791383429638124, 0.0738177221067812, 0.07111840615962706, 0.0673345711222398, 0.06431875318226271, 0.06074216826770115, 0.052948996685723906]
print(scipy.stats.ttest_ind(list2,[0]*10))
Each t-test should compare each one of the 10 values in list to 0. that is, I should get 10 t-test comparison, so 10 t-tests should be outputted
All of this is to say: I am seeking 10 rows of output (each corresponding to a unique t-test, therefore I am seeking 10 t-tests), but the code I have now just provides me with one row output, i.e. just one test
listofzeros=[0,0,0,0,0,0,0,0,0,0]
for i in range(10):
print(scipy.stats.ttest_ind(list2,listofzeros))
Firstly, there is no need to use stats.ttest_ind and create a list of zeros with the same length as the sample. You just can use stats.ttest_1samp, as follows:
print(scipy.stats.ttest_1samp(list2,0,))
That will lead to the same result but without tweaking returning the r-static value and the p-value for the mean of the input sample not returning the results per value per sample.
To be more comprehensive, The t-test is used to determine whether the sample "mean" is statistically significantly different from the population "mean".
What you are trying to do is to perform a two-sample T-test which will work on the mean of the two lists, not on every two associated values of the two samples.

binomial distribution z-score value too large

i try to solve this question
by
n =500 ,p=0.9/100 and q=1-0.9/100
but im geting z-score and mean very large .
Paycheck Errors The payroll department of a hospital has found that in one year, 0.9% of its paychecks are calcu- lated incorrectly. The hospital has 500 employees.
(a) What is the probability that in one month’s records no paycheck errors are made?
(b) What is the probability that in one month’s records at least one paycheck error is made?
Z transformation is a poor approximation to the binomial distribution for npq < 10. For your problem npq == 4.4595, so the Z approximation is a no-go.
You'd do better to calculate it exactly as a binomial using software, or approximate it as a Poisson with rate λ=np. Once you solve part (a), part (b) is just the complement.
I went ahead and calculated part (a) both ways. The Poisson approximation differs from the exact calculation by only 0.00022.
You should use binomial distribution formula rather than sampling distribution formula.

Generating random numbers with normal distribution in Excel

I want to produce 100 random numbers with normal distribution (with µ=10, σ=7) and then draw a quantity diagram for these numbers.
How can I produce random numbers with a specific distribution in Excel 2010?
One more question:
When I produce, for example, 20 random numbers with RANDBETWEEN(Bottom,Top), the numbers change every time the sheet recalculates. How can I keep this from happening?
Use the NORMINV function together with RAND():
=NORMINV(RAND(),10,7)
To keep your set of random values from changing, select all the values, copy them, and then paste (special) the values back into the same range.
Sample output (column A), 500 numbers generated with this formula:
IF you have excel 2007, you can use
=NORMSINV(RAND())*SD+MEAN
Because there was a big change in 2010 about excel's function
As #osknows said in a comment above (rather than an answer which is why I am adding this), the Analysis Pack includes Random Number Generation functions (e.g. NORM.DIST, NORM.INV) to generate a set of numbers. A good summary link is at http://www.bettersolutions.com/excel/EUN147/YI231420881.htm.
Rand() does generate a uniform distribution of random numbers between 0 and 1, but the norminv (or norm.inv) function is taking the uniform distributed Rand() as an input to generate the normally distributed sample set.
About the recalculation:
You can keep your set of random values from changing every time you make an adjustment, by adjusting the automatic recalculation, to: manual recalculate. (Re)calculations are then only done when you press F9. Or shift F9.
See this link (though for older excel version than the current 2013) for some info about it: https://support.office.com/en-us/article/Change-formula-recalculation-iteration-or-precision-73fc7dac-91cf-4d36-86e8-67124f6bcce4.
Take a look at the Wikipedia article on random numbers as it talks about using sampling techniques. You can find the equation for your normal distribution by plugging into this one
(equation via Wikipedia)
As for the second issue, go into Options under the circle Office icon, go to formulas, and change calculations to "Manual". That will maintain your sheet and not recalculate the formulas each time.
Another interesting way to do this is using the Box-Muller Method. This lets you generate a normal distribution with mean of 0 and standard deviation σ (or variance σ2) of 1 using two uniform random distributions between 0 and 1. Then you can take this Norm(0,1) distribution and scale it to whatever mean and standard deviation you want.
Here's the formula in excel for a normal(0, 1) distribution:
=SQRT(-2*LN( RAND()))*COS(2 * PI()*RAND())
Then use this formula to scale your normal distribution to mean 10 and standard deviation of 7:
Norm(µ=b, σ=a) = a*Norm(µ=0, σ2=1) + b
This would make the equation in Excel:
=7* SQRT(-2*LN( RAND()))*COS(2 * PI()*RAND()) + 10
You can read more about the math behind this Box-Muller Equation on en.Wikipedia
Note that this equation only works if you calculate the cosine function using radians.
The numbers generated by
=NORMINV(RAND(),10,7)
are uniformally distributed. If you want the numbers to be normally distributed, you will have to write a function I guess.

Resources