I've had a bit of trouble explaining this so please bear with me. I'm also very new to using excel so if there's a simple fix, I apologize in advance!
I have two columns, one listing number of days starting from 0 and increasing consecutively. The other column has the number of orders delivered. The two correspond to each other. For example, I've typed out how it would look below. It would mean that there were 100 orders delivered in 1 day, 150 orders delivered in 2 days, 800 orders delivered in 3 days, etc.
Is there a way to get summary statistics (mean, median, mode, upper and lower quartiles) for the number of days it took for the average order to get delivered? The only way I can think of solving this is to manually punch in "1" 100 times, "2" 150 times, etc. into a new column and take median, mean, and upper & lower quartile from that, but that seems extremely inefficient. Would I use a pivot table for this? Thank you in advance!
I tried using the data analysis add-on and doing summary statistics that way, but it didn't work. It just gave me the mean, median, mode, and quartiles of each individual column. It would have given me 3 for median number of days for delivery and 300 for median number of orders.
Method 1
The mean is just
=SUMPRODUCT(A2:A6,B2:B6)/SUM(B2:B6)
Mode is the value with highest frequency
=INDEX(A2:A6,MATCH(MAX(B2:B6),B2:B6,0))
The quartiles and median (or any other quantile by varying the value of p) from first principles following this reference
=LET(p,0.25,
values,A2:A6,
freq,B2:B6,
N,SUM(freq),
h,(N+1)*p,
floorh,FLOOR(h,1),
ceilh,CEILING(h,1),
frac,h-floorh,
cusum,SCAN(0,SEQUENCE(ROWS(values)),LAMBDA(a,c,IF(c=1,0,a+INDEX(freq,c-1)))),
xlower,XLOOKUP(floorh-1,cusum,values,,-1),
xupper,XLOOKUP(ceilh-1,cusum,values,,-1),
xlower+(xupper-xlower)*frac)
Method 2
If you don't like doing it this way, you can always expand the data like this:
=AVERAGE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=MODE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),1)
=MEDIAN(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
and
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),3)
I have an excel dataset of 24-hour moving averages for PM10 air pollution concentration levels, and need to obtain the individual hourly readings from them. The moving average data is updated every hour, so at hour t, the reading is the average of the 24 readings from t-23 to t hours, and at hour t+1, the reading is the average of t-22 to t+1, etc. I do not have any known data points to extrapolate from, just the 24-hour moving averages.
Is there any way I can obtain the individual hourly readings for time t, t+1, etc, from the moving average?
The dataset contains data over 3 years, so with 24 readings a day (at every hour), the dataset has thousands of readings.
I have tried searching for a possible way to implement a simple excel VBA code to do this, but come up empty. Most of the posts I have seen on Stackoverflow and stackexchange, or other forums, involve calculating moving averages from discrete data, which is the reverse of what I need to do here.
The few I have seen involve using matrices, which I am not very sure how to implement.
(https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average)
(https://stats.stackexchange.com/questions/112502/estimating-original-series-from-their-moving-average)
Any suggestions would be greatly appreciated!
Short answer: you can't.
Consider a moving average on 3 points. And even consider we multiply each MA term by 3, so we really have sums of consecutive
Data: a b c d e f g
MA a+b+c
b+c+d
c+d+e
d+e+f
e+f+g
With initial values, you can do something. To find the value of d, you would need to know b+c, hance to know a (since a+b+c is known). Then to find e, you know c+d+e and d, so you must find c, and since a is already needed, you will need also need b.
More generally, for a MA of length n, if you know the first n-1 values (hence also the nth, since you know the sum), then you can find all subsequent values. You can also start from the end. But basically, if you don't have enough original data, you are lost: there is a 1-1 relation between the n-1 first values of your data and the possible MA series. If you don't have enough information, there are infinitely many possibilities, and you can't decide which one is right.
Here I consider the simplest MA where the coefficient of each variable is 1/n (hence you compute the sum and divide by n). But this would apply to any MA, with slightly more complexity to account for different coefficients for each term in the sum.
I have got 1500 rows of travels. In column A I have got total time on travel, in column B total km driven. In column C I did calculation on the average speed of specific travel. Whats the best way to calculate the average speed of all travels? The lengths are from 0 to 20 kms approx, time always shorter than one hour.
First I eliminated all travels shorter than 2 km then
I managed to do a frequency table and have written frequencies of speeds in 0-5,5-10,... km/h. Now I can do a histogram, but should I eliminate more data or how to approach this problem?
In another cell enter:
=SUM(B:B)/SUM(A:A)
A common error would be to try to average the values in column C.
it depends on your data. if it is statistics, don't throw data away, use them.
you have column A for travel time, and column B for travle distance. using this two column you can find the total average speed like what Gary's student suggest i.e. SUM(B:B)/SUM(A:A).
you also have column C the average speed for each travel, you can use this two counter check. simply do SUMPRODUCT(A:A,C:C), you should find the result equals to SUM(B:B). if the results match, then i'll say "ok i'm satisfied with my calculation".
I made a little test machine that accidentally created a 'big' data set:
6 columns with +/- 550.000 rows.
The end result I am looking for is a graph with 6 lines, horizontal axis 1 - 550.000 measurements and vertically the values in the rows. (capped at 200 or so). Data is a resistance measurement that should be between 0 - 30 or very big (borken), the software writes 'inf' in these cases.
My skill is limited to excel, so what have I done until now:
Imported in Excel. The measurements are valuable between 0 - 30 and inf is not good for a graph, so I did: if(cell>200){200}else{keep cell value}.
Now making a graph is a timely exercise and excel does not like this, result is not good.
So I would like to take the average value of 60 measurements to reduce the rows to below 10.000. So =AVERAGE(H1:H60)
But I cannot get this to work.
Questions:
How do I reduce this data set and get a good graph.
Should I switch
to other software that is more applicable?
FYI: I already changed the software of the testing device to take the average value of a bunch of measurements the next time... But I cannot repeat this test.
Download link of data set comma separated file 17MB
I think you are on the right track, however my guess is that you only want to get an average every 60 rows and are unsure how to do this.
Using MOD(Number, Divisor) inside an if statement will let you specify that the average should be calculated only once in every x number of cells.
Assuming you'll have one row above your data table for headers, you are looking for something along the lines of:
=IF(MOD(ROW(A61),60) = 1,AVERAGE(H2:H61),"")
Once you have this you can filter your average column to non-blank values and use this to create your graph.
I have a data matrix depicting the number of telephone calls from one telephone to another, all calls are unidirectional. The rows represent days and the columns represent hours. The data is not a sample - it is the full population. Rows are days of the week and columns are one hour blocks of a 24 hour clock. Values in the cells represent the number of telephone calls from telephone A to telephone B for that specific hour.
I would like to have a repeatable measure that enables me to tell my audience that the likelihood of this distribution occurring randomly is <x.
I'd like the formula for Excel 2007 or, as a last resort, VBA code.
I've searched and found answers that tell me how to statistically determine the significance of differences between two different data sets but not how to measure for just one data set against a random outcome.
Thanx in advance.
If the total number of calls in a given hour is T, and the total calling population is P; then the number of calls from A to B should be about T/P if "random". To test whether this is really the case you'd use the Chi-squared test. I'm afraid I don't have time to give you the full answer - but it'd be the testvalue=sum((observed_i/P - (T/P))^2/(T/P)) where you check the testvalue against the chi-squared table, and you can pick off the probability too. Excel can calculate these values. Refer Chi-Squared Test for more details.