I've had a bit of trouble explaining this so please bear with me. I'm also very new to using excel so if there's a simple fix, I apologize in advance!
I have two columns, one listing number of days starting from 0 and increasing consecutively. The other column has the number of orders delivered. The two correspond to each other. For example, I've typed out how it would look below. It would mean that there were 100 orders delivered in 1 day, 150 orders delivered in 2 days, 800 orders delivered in 3 days, etc.
Is there a way to get summary statistics (mean, median, mode, upper and lower quartiles) for the number of days it took for the average order to get delivered? The only way I can think of solving this is to manually punch in "1" 100 times, "2" 150 times, etc. into a new column and take median, mean, and upper & lower quartile from that, but that seems extremely inefficient. Would I use a pivot table for this? Thank you in advance!
I tried using the data analysis add-on and doing summary statistics that way, but it didn't work. It just gave me the mean, median, mode, and quartiles of each individual column. It would have given me 3 for median number of days for delivery and 300 for median number of orders.
Method 1
The mean is just
=SUMPRODUCT(A2:A6,B2:B6)/SUM(B2:B6)
Mode is the value with highest frequency
=INDEX(A2:A6,MATCH(MAX(B2:B6),B2:B6,0))
The quartiles and median (or any other quantile by varying the value of p) from first principles following this reference
=LET(p,0.25,
values,A2:A6,
freq,B2:B6,
N,SUM(freq),
h,(N+1)*p,
floorh,FLOOR(h,1),
ceilh,CEILING(h,1),
frac,h-floorh,
cusum,SCAN(0,SEQUENCE(ROWS(values)),LAMBDA(a,c,IF(c=1,0,a+INDEX(freq,c-1)))),
xlower,XLOOKUP(floorh-1,cusum,values,,-1),
xupper,XLOOKUP(ceilh-1,cusum,values,,-1),
xlower+(xupper-xlower)*frac)
Method 2
If you don't like doing it this way, you can always expand the data like this:
=AVERAGE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=MODE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),1)
=MEDIAN(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
and
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),3)
Background
I wish to compare menu sales mix ratios for two periods.
A menu is defined as a collection of products. (i.e., a hamburger, a club sandwich, etc.)
A sales mix ratio is defined as a product's sales volume in units (i.e., 20 hamburgers) relative to the total number of menu units sold (i.e., 100 menu items were sold). In the hamburger example, the sales mix ratio for hamburgers is 20% (20 burgers / 100 menu items). This represents the share of total menu unit sales.
A period is defined as a time range used for comparative purposes (i.e., lunch versus dinner, Mondays versus Fridays, etc.).
I am not interested in overall changes in the volume (I don't care whether I sold 20 hamburgers in one period and 25 in another). I am only interested in changes in the distribution of the ratios (20% of my units sold were hamburgers in one period and 25% were hamburgers in another period).
Because the sales mix represents a share of the whole, the mean average for each period will be the same; the mean difference between the periods will always be 0%; and, the sum total for each set of data will always be 100%.
Objective:
Test whether the sales distribution (sales mix percentage of each menu item relative to other menu items) changed significantly from one period to another.
Null Hypothesis: the purchase patterns and preferences of customers in period A are the same as those for customers in period B.
Example of potential data input:
[Menu Item] [Period A] [Period B]
Hamburger 25% 28%
Cheeseburger 25% 20%
Salad 20% 25%
Club Sandwich 30% 27%
Question:
Do common methods exist to test whether the distribution of share-of-total is significantly different between two sets of data?
A paired T-Test would have worked if I was measuring a change in the number of actual units sold, but not (I believe) for a change in share of total units.
I've been searching online and a few text books for a while with no luck. I may be looking for the wrong terminology.
Any direction, be it search terms or (preferably) the actual names appropriate tests, are appreciated.
Thanks,
Andrew
EDIT: I am considering a Pearson Correlation test as a possible solution - forgetting that each row of data are independent menu items, the math shouldn't care. A perfect match (identical sales mix) would receive a coefficient of 1 and the greater the change the lower the coefficient would be. One potential issue is that unlike a regular correlation test, the changes may be amplified because any change to one number automatically impacts the others. Is this a viable solution? If so, is there a way to temper the amplification issue?
Consider using a Chi Squared Goodness-of-Fit test as a simple solution to this problem:
H0: the proportion of menu items for month B is the same as month A
Ha: at least one of the proportions of menu items for month B is
different to month A
There is a nice tutorial here.
I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.
Ive created a game and in that game played 5 users which collected few points, Ive gived gifts manually but for next games how can i split or make in excel to calculate number of gifts,
this is ok using number format with 0 decimal places, 6+1+1+1 = 9
but in cases like this:
1+6+1+1+1 = 10, how can I make that only 9 gifts results?
You should be comparing their percent (B2/SUM(B2:B6)) against each prize as it relates to the total prize (e.g. 1/9). Since you are comparing decimal numbers with another decimal number and expecting an integer (no. of prizes), you will be rounding either up or down depending on whether you are favoring a wider distribution of the prizes or favoring the top score.
Either way you are going to have to decide whether the lowest score should always receive a prize or if the highest score should benefit from the points awarded.
The three possible formulas to start with would be,
=MROUND(C2, 1/9)*9 ◄ closest to even distribution
=FLOOR(C2, 1/9)*9 ◄ favours wider prize distribution
=CEILING(C2, 1/9)*9 ◄ rewards highest awarded points
Fill down as necessary.
Now you have to either take the highest or lowest score and adjust that to compensate for rounding the division of decimal numbers to an integer. MROUND doesn't play well with SUMPRODUCT but these two may give you a solution that you can live with.
=FLOOR($C2, 1/9)*9-((SUMPRODUCT(FLOOR($C$2:$C$6, 1/9)*9)-9)*($C2=MAX($C$2:$C$6)))
=CEILING($C2, 1/9)*9-((SUMPRODUCT(CEILING($C$2:$C$6, 1/9)*9)-9)*($C2=MAX($C$2:$C$6)))
Fill down as necessary.
If the MROUND solution is best suited to your prize distribution model, use a helper column that can determine the MROUND returns and then adjust the high score according to the sum of the helper column without circular references.
I have a data matrix depicting the number of telephone calls from one telephone to another, all calls are unidirectional. The rows represent days and the columns represent hours. The data is not a sample - it is the full population. Rows are days of the week and columns are one hour blocks of a 24 hour clock. Values in the cells represent the number of telephone calls from telephone A to telephone B for that specific hour.
I would like to have a repeatable measure that enables me to tell my audience that the likelihood of this distribution occurring randomly is <x.
I'd like the formula for Excel 2007 or, as a last resort, VBA code.
I've searched and found answers that tell me how to statistically determine the significance of differences between two different data sets but not how to measure for just one data set against a random outcome.
Thanx in advance.
If the total number of calls in a given hour is T, and the total calling population is P; then the number of calls from A to B should be about T/P if "random". To test whether this is really the case you'd use the Chi-squared test. I'm afraid I don't have time to give you the full answer - but it'd be the testvalue=sum((observed_i/P - (T/P))^2/(T/P)) where you check the testvalue against the chi-squared table, and you can pick off the probability too. Excel can calculate these values. Refer Chi-Squared Test for more details.