Likelihood of a Distribution of Values Occurring Randomly - excel

I have a data matrix depicting the number of telephone calls from one telephone to another, all calls are unidirectional. The rows represent days and the columns represent hours. The data is not a sample - it is the full population. Rows are days of the week and columns are one hour blocks of a 24 hour clock. Values in the cells represent the number of telephone calls from telephone A to telephone B for that specific hour.
I would like to have a repeatable measure that enables me to tell my audience that the likelihood of this distribution occurring randomly is <x.
I'd like the formula for Excel 2007 or, as a last resort, VBA code.
I've searched and found answers that tell me how to statistically determine the significance of differences between two different data sets but not how to measure for just one data set against a random outcome.
Thanx in advance.

If the total number of calls in a given hour is T, and the total calling population is P; then the number of calls from A to B should be about T/P if "random". To test whether this is really the case you'd use the Chi-squared test. I'm afraid I don't have time to give you the full answer - but it'd be the testvalue=sum((observed_i/P - (T/P))^2/(T/P)) where you check the testvalue against the chi-squared table, and you can pick off the probability too. Excel can calculate these values. Refer Chi-Squared Test for more details.

Related

Number of days for delivery and number of orders delivered in two separate columns. Is there a way to get summary statistics about orders?

I've had a bit of trouble explaining this so please bear with me. I'm also very new to using excel so if there's a simple fix, I apologize in advance!
I have two columns, one listing number of days starting from 0 and increasing consecutively. The other column has the number of orders delivered. The two correspond to each other. For example, I've typed out how it would look below. It would mean that there were 100 orders delivered in 1 day, 150 orders delivered in 2 days, 800 orders delivered in 3 days, etc.
Is there a way to get summary statistics (mean, median, mode, upper and lower quartiles) for the number of days it took for the average order to get delivered? The only way I can think of solving this is to manually punch in "1" 100 times, "2" 150 times, etc. into a new column and take median, mean, and upper & lower quartile from that, but that seems extremely inefficient. Would I use a pivot table for this? Thank you in advance!
I tried using the data analysis add-on and doing summary statistics that way, but it didn't work. It just gave me the mean, median, mode, and quartiles of each individual column. It would have given me 3 for median number of days for delivery and 300 for median number of orders.
Method 1
The mean is just
=SUMPRODUCT(A2:A6,B2:B6)/SUM(B2:B6)
Mode is the value with highest frequency
=INDEX(A2:A6,MATCH(MAX(B2:B6),B2:B6,0))
The quartiles and median (or any other quantile by varying the value of p) from first principles following this reference
=LET(p,0.25,
values,A2:A6,
freq,B2:B6,
N,SUM(freq),
h,(N+1)*p,
floorh,FLOOR(h,1),
ceilh,CEILING(h,1),
frac,h-floorh,
cusum,SCAN(0,SEQUENCE(ROWS(values)),LAMBDA(a,c,IF(c=1,0,a+INDEX(freq,c-1)))),
xlower,XLOOKUP(floorh-1,cusum,values,,-1),
xupper,XLOOKUP(ceilh-1,cusum,values,,-1),
xlower+(xupper-xlower)*frac)
Method 2
If you don't like doing it this way, you can always expand the data like this:
=AVERAGE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=MODE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),1)
=MEDIAN(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
and
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),3)

Counting number of occurrences in a time series for a set of condition(s)

I am required to do some summary statistics on the attached table as an example.
Some of the questions to answer include:
1) How many countries with valid time series (countries that have at least one value/number for a given indicator name over the time period of 2010-2015)
e.g: Count how many countries have valid times series for the indicator: "Number of completed applications"
2) For a given country and indicator what is the number of year(s) with valid time series.
e.g: For the indicator number of completed applications and the country Canada? (Answer: 2 --> 2014, 2015)
Alternatively, if the table look like this instead (which is a typical csv format) what approach could be taken to answer the two summary statistics questions above?
I have tried method of sumproduct formula for the pivoted table. Is there a better way than this method?
=SUMPRODUCT(N((B2:B14>0)+(C2:C14>0)+(D2:D14>0)+(E2:E14>0)+(F2:F14>0)+(G2:G14>0)+(H2:H14>0)+(I2:I14>0)+(J2:J14>0)>0))
But what about when it is a flat table?
So, an example of countifs() and also sumifs():
From Nevsky -- Thanks a lot for the example! I took the liberty to modify it a bit as follows :

Statistical method for time-course data comparison

I have a question for statistical method which i cant find in my textbook. I want to compare data of two groups. For example, both group have data of day 0, but one group have data of day 2, and another day 6. How can I analyse the outcome with the data and the date? i.e. I want to show that the if data taken on day XX are YY, it has an impact on the outcome.
Thanks in advance.
I'd use a repeated measures ANOVA in this case. However, since you don't have a complete dataset, day X and Y would be just operationalized as the endpoint of your dependent variable. If you'd have measures of all days I'd include.all of them in the analysis in order to fully compare the two timelines. You could then also compare the days of interest directly by using post-hoc tests (e.g. Bonferroni)

Simulation in Excel using probability

I am trying to create a spreadsheet that can find the most likely probability that a student scored a specific grade on a test.
Only one student can score a grade and only one grade can have a student.
I have limited information about each student.
There are 5 students (1,2,3,4,5)
and the grades possible are only (100,90,80,70,60)
In the spreadsheet a 1 denotes that the student DIDN'T score that grade.
Does anyone know how to make a simulation that I can find the most likely probability of what student scored what grade?
Link:
https://docs.google.com/spreadsheets/d/1a8uUIRzUKsY3DolTM1A0ISqMd-42WCUCiDsxmUT5TKI/edit?usp=sharing
Based on your response in comments, each student has an equal likelihood of getting each grade. No simulation is necessary.
If you want to simulate it anyway, don't use Excel*. Create a vector of students, and pair it with a shuffled vector of the grades. Lather, rinse, repeat as many times as you want to see that the student-to-grade matching is uniformly distributed.
* - To get an idea of how bad Excel can be for random variate generation, enable the Analysis Toolpak, go to "Data -> Data Analysis" on the ribbon, and select "Random Number Generation". Fill in the tabs that you want 10 variables, number of random numbers 2000, a "Normal" distribution, leave the mean and std dev at 0 and 1, and enter a "Random Seed" value of 123. You will find that the resulting table contains 3 instances of the value "-9.35764". Values that extreme should occur about once per twenty thousand years if you generate a billion a second. Getting three of them is so extreme that it should happen once per 1030 times the current estimated age of the universe. Conclude that a) it's your lucky day, or b) Excel sucks at random numbers, and despite being informed about this as far back as 1998 Microsoft hasn't bothered to fix it.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources