I have a population that has an average weight of 5 lbs. I've taken sample of 5879 observations from the population. The sample has a total weight of 410522 lbs. I'm trying to figure out if the sample has a significantly higher average weight than the population. Assuming that the population has a normal distribution. I'm trying to use the proportions_ztest from stats model. I'm not sure if I'm using the counts and nobs variables correctly. Can someone please tell me if I'm using the function correctly, or suggest another function? I'm trying to get the p-value.
code:
import statsmodels.api as sm
cnt=410522
nbs=58759
vL=5
sm.stats.proportions_ztest(cnt,
nbs,
vL,
alternative='larger')[1]
You can use scipy.stats.ttest_1samp(a, popmean) to get t and p_value.
This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean, popmean.
Read more detail here.
If you want to test if the samples has a significantly higher average weight than the population, you should divide the P-value/2 to get right-tailed P_value.
Related
I am working on a bitcoin price predictor, and I realize that it's non sense to predict an exact price at a given time.
What we want when predicting some currency price can be summarize with this question: "What is the probability for the price to reach value X in a specific time range ?"
I have hard time to integrate this thinking into a RNN /LSTM architecture. My first thought was to build a custom Loss function that compare the output of the RNN (typically, a predicted price) with the real lower and upper price of the next day, then if the lower_price < predicted_value < upper_price the RNN output should be "classified" as correct (loss = 0), otherwise the loss would be > 0. But I am sure there already exists a better solution for this kind of problem.
Any idea ?
Thank you
There are a number of different ways to do what you are asking. However I think what you are looking for is a quantile loss function.
def tilted_loss(q,y,f):
e = (y-f)
return K.mean(K.maximum(q*e, (q-1)*e), axis=-1)
Notebook with full source code. Or if you prefer PyTorch you can find an implementation here.
Say we are converting data from one bookstore application to another. How would one going about calculating the sample size of books to review after the data conversion to be sure 90+-5% of all books converted correctly?
Say our existing book list contains 30 books. How many books would we have to review in the new application after the data conversion to be 85-95% certain that all books converted correctly?
Ok, let's assume the variable X=Proportion of books converted correctly, distributed normally, with values between 0 and 1
Sample size = this is what we want to determine
Population size = 30
Existing book list contains 30 books
Estimated value = 0.90
That is, the value of X that you think is real.
90+-5% of all books converted correctly
If you have no idea of what's the actual value, use 0.5 instead
Error margin = 0.05
The difference between the real value and the estimated value. As you ascertained above, this would be +-5%
Confidence level = 0.95
This is NOT the same as error margin. You are making a prediction, how sure do you want to be of your prediction? This is the confidence level. You gave two values above:
to be 85-95% certain that all books converted correctly
So we're going with 95%, just to be sure.
The recommended sample size is 25
You can use this calculator to arrive to the same results
https://select-statistics.co.uk/calculators/sample-size-calculator-population-proportion/
And it also has a magnificent explanation of all the input values above.
Hope it works for you. Cheers!
I've been attempting to program a PowerPivot Workbook that I've been using to calculate a weighted standard deviation.
The problem is that when I use the code:
(the quality metric Q is weighted by the Product Tons for each record to get weighted statistics for variable periods [ie weeks, months, years])
Product Q-St.d:=SQRT((SUMX('Table',((([PRODUCT_Q]-[W_Avg_Q]))^2)*[TOTAL_PRODUCT_TONS]))/(((COUNTX('Table',[Production_Q])-1)*[Product Tons])/COUNTX('Table',[Production_Q])))
It calculates the [W_Avg_Q], which is the weighted average for Q, for each row as it iterates through instead of getting a weighted average for the whole context. I've learned pretty much all my DAX on the job or this site so I'm hoping there's some command to get the weighted average to calculate first. Does anyone know such a command? or another method of getting a weighted standard deviation out of DAX?
I think what you want to do is to declare [W_Avg_Q] a variable and then use it in your formula.
Product Q-St.d :=
VAR WtdAvg = [W_Avg_Q]
RETURN SQRT((SUMX('Table',((([PRODUCT_Q]-WtdAvg))^2)*[TOTAL_PRODUCT_TONS])) /
(((COUNTX('Table',[Production_Q])-1)*[Product Tons])/COUNTX('Table',[Production_Q])))
This way it gets calculated once in the proper context and then stored and reused within the formula.
i am simulating an M/M/1 queue in excel where i want to generate random values for arrival rate lambda and service rate (meu) in two columns such that :
arrival rate (lambda) = 2
service rate (meu) = 5 (but the average of meu always remains 1 in the column despite
generating the values for meu randomly using rand().
How can i generate random values using rand() for lambda and meu such that their averages can be restricted to a fixed value in the respective columns. I need to run the simulation for different values of utilization where utilization = lambda/meu ratio (i need to use 0.2, 0.4, 0.6 and 0.8 for lambda/meu). But the values in the column for both would be random and average of meu should be fixed at 1. I will change lambda for different utilization ratios then.
Excel lacks an exponential RV generating function, but you can generate your own with the desired rate via inversion: -log(U) / rate has an exponential distribution, where U is a uniform(0,1) random value and rate is the desired rate (mu or lambda).
That said, I really wouldn't recommend using Excel for anything other than toy problems unless you've read papers such as this one and understand the issues you may be dealing with.
As an aside in case you haven't seen it before, there's an easy way to model a first-come-first-served single-server queue using iterative logic.
I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm