EDIT: To clarify, I'm specifically attempting to project the number of points that will be scored by a particular player in the NBA on any given night. So the groups below display the points scored by Player A, Player B, and Player C. Then, when all three players are on my team their combined averages are 73.25. What I'm trying to calculate is the Standard Deviation from the combined points that all three players have scored. In other words, if I put all three players on my team, and the would combime to score an average of 73.25 points, what would the STDEV on that average be based on the number sets that I have?
I'm trying to figure out what the Standard Devation of a group of multiple subsets would be once they are combined.
I have 3 sets of numbers, as follows:
Group A:
9
18
27
26
Avg: 20
STDEV: 7.25
Group B:
15
27
32
18
Avg:23
STDEV: 6.82
Group C:
19
48
34
20
Avg: 30.25
STDEV: 11.84
So I have the Standard Deviation for each individual group, but if I were to combine the three groups, for example, the average would be 73.25 (the sum of all of the averages). How would I calculate the Standard Deviation for the total I would get for all of these sets combined?
If I take the STDEV of every number I get 9.91, but the average of 73.25 and STDEV of 9.91 doesn't seem right to me. Is there another way to do it?
If I am understanding this correctly groups A,B and C are actually players A,B and C. Given that I would think that it would be significant to say something like player B averages 23 points per game with a standard deviation of 6.8 so that we could project he scores in the 16 to 30 point per game range.
I am further assuming that the four numbers given are for four games and that the 73.25 average is the average number of combined points scored by the three players in the four games. So my thinking is that to get the average standard deviation of the three players combined you would get the standard deviation of the game totals. This would give you a resulting standard deviation of 21 pts. meaning that you could project that the combined points of the three players would be anywhere from 52 to 94 points in a game. (Notice that the first game turns out to be an outlier game for the three players combined and for players A and B individually.)
Related
Can t-test be calculated on large samples with non-normal distribution?
For example, the number of users in group A is 100K, the number of users in group B is 100K. I want to test whether the average session duration of these two groups is statistically significant.
1st method) We calculated the average session duration of these users on the day after the AB test (DAY1) as
31.2 min for group A
30.2 min for group B.
We know that users in groups A and B have a non-normal distribution of DAY1 session values.
In such a case, would it be correct to use two samples t-test to test the DAY1 avg session durations of two groups? (We will accept n=100K)
(Some sources say that calculating t-scores for large samples will give accurate results even with non-normal distribution.)
2nd method) Would it be a correct method to calculate the t-score over the daily average session duration during the day the AB test is open?
E.g; In the scenario below, the average daily session duration of 100K users in groups A and B are calculated. We will accept the number of days here as the number of observations and get n=30.
We will also calculate the two-sample t-test calculation over n=30.
Group
day0 avg duration
day1 avg duration
day2 avg duration
...
day30 av gduration
A
30.2
31.2
32.4
...
33.2
B
29.1
30.2
30.4
...
30.1
Do these methods give correct results or is it necessary to apply another method in such scenarios?
Would it make sense to calculate t-test on large samples in AB test?
The t-test assumes that the means of different samples taken from a population are normally distributed. It doesn't assume that the population itself is normally distributed.
For a population with finite variance, the central limit theorem suggests that the means of samples from the population are normally distributed. However, the sample size needed for the distribution of means to be approximately normal depends on the degree of non-normalness of the population. The t-test is invalid for small samples from non-normal population distributions, but is valid for large samples from non-normal distributions.
Method 1 works because of this reason (large sample size ~100K) and you are correct that calculating t-scores for large samples will give accurate results even with non-normal distribution. [You may also consider using a z-test for the sample sizes you're working with (100K). T-tests are more appropriate for smaller sample sizes, such as n < 30]
Method 2 works because the daily averages should be normally distributed given enough samples per the central limit theorem. Time-spent datasets may be skewed but generally work well.
I'm having a difficulty to figure out how to distribute 2 percentages fields in 3.
Let's say we have 2 teams. I think they both have chances 50/50 means that the draw has the percent of 50, 25 for the first team and 25 for the second team. So i'm stuck in finding the formula to calculate Draw chances.
Another example, if i believe that first team win percent is 70% and second team win is 30%. What would be the % for the draw?
I have tried to divide each team % by 3 and multiply by 2, then deduct sum of these results from 100%, but obviously, in every case I get 33% for the draw. For example I get 33% for the draw in case if chances of both teams are 50/50, instead of 50 for the draw, 25 for first team and 25 for the second team.
Thank you!
I am working on a tool for Fantasy Football that calculates the average value a player offers per million pounds of cost. It essentially boils down to their average points per game divided by their cost.
So for example, a player who costs £10m and scores an average of 5 points per game offers 0.5 points per game per million. Whereas a player who costs £8m and scores an average of 5 points per game offers 0.625 points per game per million. Clearly the player who costs £8m is better value.
My problem is, players are capable of scoring negatively, and so how do I account for that in calculating the value of a player?
To give another example, a player who costs £10m and scores an average of -2 points per game offers -0.2 points per game per million. Whereas a player who costs £8m and scores an average of -2 points per game offers -0.25 points per game per million.
Now the player who costs £10m appears to be better value because their PPG/£m is higher. This shouldn't be true, they can't be better value if they cost more but score the same points. So if I have a list of players sorted by their value, calculated in this manner, some players will incorrectly show higher than players that are technically better value.
Is there a way to account for this problem? Or is just an unfortunate fact of the system I'm using?
One simple trick will be to slightly change your formula for PPG/£m as the ratio of the square of the average points he scored and the cost.
If you are particular about the scales, consider its positive square root.
You need 100 lbs of bird feed. John's bag can carry 15 lbs and Mark's bag can carry 25 lbs. Both guys have to contribute exactly the same total amount each. What's the lowest number of trips each will have to take?
I have calculated this using systems of equations.
15x + 25y = 100
15x - 25y = 0
This equals out to:
John would have 3.33 trips and Mark would have 2 trips. Only one problem: you can't have 1/3 of a trip.
The correct answers is:
John would take 5 trips (75 lbs) and Mark would take 3 trips (75 lbs).
How do you calculate this? Is there an excel formula which can do both layers of this?
Assuming you put the total bird feed required in A1 and John's and Mark's bag limits in B1 and B2 respectively, then this formula in C1:
=MATCH(TRUE,INDEX(2*ROW(INDIRECT("1:100"))*LCM($B$1:$B$2)>=$A$1,,),0)*LCM($B$1:$B$2)/B1
will give the lowest number of trips required of John. Copying this formula down to C2 will give the equivalent result for Mark.
Note that the 100 in the part:
ROW(INDIRECT("1:100"))
was arbitrarily chosen and will give correct results providing neither John nor Mark is required to make more than twice that number of trips, i.e. 200. Obviously you can amend this value if you feel it necessary (up to a theoretical limit of 2^20).
Regards
Since John and Mark need to carry the same total amount of bird feed, what they will carry has to be a multiple of the least common multiple.
Since they both carry that amount the total amount will always be an even multiple of the LCM.
So find the least even multiple of the LCM that is larger than 100. And calculate the number of trips John and Mark will have to take from that.
For John:
CEILING(100/(2*LCM(15; 25));1)*LCM(15;25)/15
For Mark:
CEILING(100/(2*LCM(15; 25));1)*LCM(15;25)/25
I have a set of data that has over 15,000 records in Excel that is from a measurement tool that finds trends over a large areas. I'm not interested in looking for trends within the data as whole but rather over the data closest to each other to get a sense of how noisy (variation with neighboring records). Almost like I want to know the average standard deviation of looking at the 15,000 or so records only at 20 records at a time. The hope is the data values trend gradually rather than sudden changes from record to record and thus looks noisy. If I add a Chart and use the "Moving Average" Trendline it kind of visually shows how noisy the data looks across the 15,000 + records. However, I was hoping to get a numeric value to rate how noisy the data is vs. other datasets. Any ideas on what I could do here with formula's built-in Excel or by adding some add-in? Let me know if I need to explain this any better.
Could you calculate your moving average for your 20 sample window, then use the difference between each point and the expected value to calculate a variance?
Hard to do tables here, but here is a sample of what I mean
Actual Measured Expected Variance
5 5.44 4.49 0.91
6 4.34 5.84 2.26
7 8.45 7.07 1.90
8 6.18 7.84 2.75
9 8.89 9.10 0.04
10 11.98 10.01 3.89
The "measured" values were determined as
measured = actual + (rand() - 0.5) * 4
The "expected" values were calculated from a moving average (the table was pulled from the middle of the data set).
The variance is simply the square of expected minus measured.
Then you could calculate an average variance as a summary statistic.
Moving average is the correct, but you need a critical element - order. Do you date/time variable or a sequence number?
Use the OFFSET function to setup your window. If you want 20, your formula will look something like AVERAGE(OFFSET(C15,-10,0,21)). This is your moving average.
Relate that to C15, whether additive or multiplicative, you'll have your distance. All we need now is your tolerance.