Combining Statistical Sets

Combining Statistical Sets - statistics

I am doing an analysis of employee performance.
And have two questions regarding sample size. (All calculations come from Survey Monkey’s Sample Size Calculator)
Say an employee performs 500 tasks per month and I want to review this employee’s performance.
If I want a confidence level of 95% and can accept a 10% margin of error then I would need to review 81 tasks. However if the population size was 1500 the sample size required for the same confidence level (95%) and margin of error (10%) would be 91.
If I'm willing to do quarterly as opposed to monthly reviews then the number of reviews drops precipitously.
QUESTION 1: Can I then assume that if I do my review quarterly as opposed to monthly that a sample size of 91 would suffice?
QUESTION 2: Can one aggregate confidence level and margin of error? For example in month one I accept a lower level of confidence and margin of error than I would otherwise want; but over time as they are aggregated I get a higher and higher confidence level.
The conceptual model is that combining them over time is identical to doing a “quarterly; “semi-annual” and “annual” analysis.
Mthly
Population: 500
Confidence Level: 80%
Margin of Error: 10%
Sample Size: 38
Over three months the aggregate would be a population of 1500 and a sample size of 114. This would give me a confidence level of 90 and a margin of error of 7.4. See below:
Quarterly
Population: 1500
Confidence Level: 90%
Margin of Error: 7.4%
Sample Size: 114
Over six months the aggregate would be a population of 3000 and a sample size of 228.
Semi-Annually Population: 3000
Confidence Level: 90%
Margin of Error: 5.23%
Sample Size: 228
Thus, if my assumptions are correct, as I aggregate these reviews I increase my confidence level and decrease my margin of error.
The purpose for this would allow me to do a more thorough weekly and monthly analysis on new employees by reducing the review load on long-term employees.

Related

estimate rolling percentile of a population from windowed sample sets

Calculating a percentile (95th, 99th) in my data set is expensive due to the large number of time series and time ranges ranging from weeks to months. The cost incurred is proportional to number of samples fetched from data store and computational overhead in processing the calculations. I am attempting to optimize the solution by calculating the statistics for smaller time ranges in parts, in stream as data points are ingested -- then estimating metrics for population from those samples. This approach works accurately for mean, peak (max) but require a good approximation for percentiles.
population_mean = mean(sample_mean_t0, sample_mean_t1, ... ,sample_mean_tn)
population_max = max(sample_max_t0, sample_max_t1 , ... , sample_max_tn)
To calculate p95, I am calculating 95th percentile over 95th percentile of all samples. Is this a reasonable approximation for calculating 95th percentile? (we are not attempting to solve the problem when there is high degree of skewness). Is there a better approximation I can use for calculating percentiles?
population_p95 = p95(sample_p95_t0, sample_p95_t1, ... , sample_p95_tn)
Does taking the average over sample p95s make more sense? Any reference here to approximate this solution and estimate errors will be helpful.

Can t-test be calculated on large samples with non-normal distribution?

Can t-test be calculated on large samples with non-normal distribution?
For example, the number of users in group A is 100K, the number of users in group B is 100K. I want to test whether the average session duration of these two groups is statistically significant.
1st method) We calculated the average session duration of these users on the day after the AB test (DAY1) as
31.2 min for group A
30.2 min for group B.
We know that users in groups A and B have a non-normal distribution of DAY1 session values.
In such a case, would it be correct to use two samples t-test to test the DAY1 avg session durations of two groups? (We will accept n=100K)
(Some sources say that calculating t-scores for large samples will give accurate results even with non-normal distribution.)
2nd method) Would it be a correct method to calculate the t-score over the daily average session duration during the day the AB test is open?
E.g; In the scenario below, the average daily session duration of 100K users in groups A and B are calculated. We will accept the number of days here as the number of observations and get n=30.
We will also calculate the two-sample t-test calculation over n=30.
Group
day0 avg duration
day1 avg duration
day2 avg duration
...
day30 av gduration
A
30.2
31.2
32.4
...
33.2
B
29.1
30.2
30.4
...
30.1
Do these methods give correct results or is it necessary to apply another method in such scenarios?
Would it make sense to calculate t-test on large samples in AB test?

The t-test assumes that the means of different samples taken from a population are normally distributed. It doesn't assume that the population itself is normally distributed.
For a population with finite variance, the central limit theorem suggests that the means of samples from the population are normally distributed. However, the sample size needed for the distribution of means to be approximately normal depends on the degree of non-normalness of the population. The t-test is invalid for small samples from non-normal population distributions, but is valid for large samples from non-normal distributions.
Method 1 works because of this reason (large sample size ~100K) and you are correct that calculating t-scores for large samples will give accurate results even with non-normal distribution. [You may also consider using a z-test for the sample sizes you're working with (100K). T-tests are more appropriate for smaller sample sizes, such as n < 30]
Method 2 works because the daily averages should be normally distributed given enough samples per the central limit theorem. Time-spent datasets may be skewed but generally work well.

One sided proportion test for significantly high values

I have a dataset where I know how many units of each product I have in starting inventory. Then I know how many units of a given product were sold. I also know how many units of all other products were sold. The question I'm trying to answer is were the total number of units sold of a particular product significantly higher than I would expect based on the products percentage of starting inventory. I've read the documentation on proportions_ztest. It talks about numbers of observations, so I want to check if I'm using it correctly for units sold. With the code below I'm trying to get the p-value.
sold= total number of units sold of product1
tot_sld= total number of units sold including all products
perc_strt= (total number of units of product1 in starting inventory)/(total number of units from all products in starting invetory)
code:
import statsmodels.api as sm
sm.stats.proportions_ztest(x['sold'],
x['tot_sld'],
x['perc_strt'],
alternative='larger')[1]
Update Example:
product1 start inventory=20 units
product2 start inventory=30 units
prodcut3 start inventory=50 units
product1 perc_strt=20%
number of units sold of product1=10 units
number of units sold of product2=10 units
number of units sold of product3=20 units
tot_sld=40 units
so
x['sold']=10
x['tot_sld']=40
x['perc_strt']=0.2
Update:
the one population proportion test from this post seems to confirm my original approach
https://towardsdatascience.com/demystifying-hypothesis-testing-with-simple-python-examples-4997ad3c5294

Tests to Compare Sales Mix Percent between Periods

Background
I wish to compare menu sales mix ratios for two periods.
A menu is defined as a collection of products. (i.e., a hamburger, a club sandwich, etc.)
A sales mix ratio is defined as a product's sales volume in units (i.e., 20 hamburgers) relative to the total number of menu units sold (i.e., 100 menu items were sold). In the hamburger example, the sales mix ratio for hamburgers is 20% (20 burgers / 100 menu items). This represents the share of total menu unit sales.
A period is defined as a time range used for comparative purposes (i.e., lunch versus dinner, Mondays versus Fridays, etc.).
I am not interested in overall changes in the volume (I don't care whether I sold 20 hamburgers in one period and 25 in another). I am only interested in changes in the distribution of the ratios (20% of my units sold were hamburgers in one period and 25% were hamburgers in another period).
Because the sales mix represents a share of the whole, the mean average for each period will be the same; the mean difference between the periods will always be 0%; and, the sum total for each set of data will always be 100%.
Objective:
Test whether the sales distribution (sales mix percentage of each menu item relative to other menu items) changed significantly from one period to another.
Null Hypothesis: the purchase patterns and preferences of customers in period A are the same as those for customers in period B.
Example of potential data input:
[Menu Item] [Period A] [Period B]
Hamburger 25% 28%
Cheeseburger 25% 20%
Salad 20% 25%
Club Sandwich 30% 27%
Question:
Do common methods exist to test whether the distribution of share-of-total is significantly different between two sets of data?
A paired T-Test would have worked if I was measuring a change in the number of actual units sold, but not (I believe) for a change in share of total units.
I've been searching online and a few text books for a while with no luck. I may be looking for the wrong terminology.
Any direction, be it search terms or (preferably) the actual names appropriate tests, are appreciated.
Thanks,
Andrew
EDIT: I am considering a Pearson Correlation test as a possible solution - forgetting that each row of data are independent menu items, the math shouldn't care. A perfect match (identical sales mix) would receive a coefficient of 1 and the greater the change the lower the coefficient would be. One potential issue is that unlike a regular correlation test, the changes may be amplified because any change to one number automatically impacts the others. Is this a viable solution? If so, is there a way to temper the amplification issue?

Consider using a Chi Squared Goodness-of-Fit test as a simple solution to this problem:
H0: the proportion of menu items for month B is the same as month A
Ha: at least one of the proportions of menu items for month B is
different to month A
There is a nice tutorial here.

Monte Carlo Simulation using Excel Solver

I am trying to figure out what the optimal number of products I should make per day are, displaying the values in a chart and then using the chart to find the optimal number of products to make per day.
Cost of production: $4
Sold for: $12
Leftovers sold for $1
So the ideal profit for a product is $8, but it could be -$3 if it's left over at the end of the day.
The daily demand of sales has a mean of 150 and a standard deviation of 30.
I have been able to generate a list of random numbers using to generate a list of how many products: NORMINV(RAND(),mean,std_dev)
but I don't know where to go from here to figure out the amount sold from the amount of products made that day.

The number sold on a given day is min(# produced, daily demand).
ADDENDUM
The decision variable is a choice you make: "I will produce 150 each day", or "I will produce 145 each day". You told us in the problem statement that daily demand is a random outcome with a mean of 150 and a SD of 30. Let's say you go with producing 150, the mean of demand. Since it's the mean of a symmetric distribution, half the time you will sell everything you made and have no losses, but in most of those cases you actually could have sold more and made more money. You can't sell products you didn't make, so your profit is capped at selling 150 on those days. The other half of the time, you won't sell all 150 and will take a loss on the unsold items, reducing your profit a bit. The actual profit on any given day is a random variable, because it is determined by random demand.
Since profit is random, you can calculate your average earnings across many days based on the assumption that you produce 150. You can also average earnings based on the assumption that you produce 140 per day, or 160 per day, or any other number. It sounds like you've been asked to plot those average earnings versus how many you decided to produce, and choose a production level that results in the highest long-term average earnings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string