How to calculate confidence interval and sample size

How to calculate confidence interval and sample size - statistics

My company has 1000 locations. We will be conducting a survey (lets say to ask "yes" or "no" about something) using a sample of about 250 locations. Based on the results, we hope to estimate the proportion of all companies that is "yes". After surveying 250, say for example the proportion of "yes" is 70% and "no" is 30%. I would like to construct a 95% confidence interval to estimate the proportion of all locations that is "yes".
Question 1 - Do I still use the regular confidence interval calculation for a population proportion, i.e p_hat +/- z*SQRT((p_hat(1-p_hat)/n), or is there another formula since my population "N" in this case is 1000.
Question 2 - Is there a statistical calculation/guidance to determine the correct number of locations to survey in the first place.

Related

Weighted Average Entry/Exit Cost

Hello I have a question regarding some data that I have extracted into Excel. In particular the Market Depth of a Aluminium. I am trying to calculate what the average entry/exit cost would be for a set number of contracts.
I would be very appreciative of any help one could provide.
Example:
I have set the amount to 20. How could i calculate the weighted average entry price? (Assuming we lift the offer ("Buy") and Hit the Bid ("Sell")
I have pasted the format of my data below. I have been using this to limit thew Vol ("Volume") to be the amount i entered where "R4=20 ATM".
=MIN(SUM($G$3:$G$12),$R$4)

In Excel, how do I find the margin of error and confidence intervals for surveys with different sample sizes and population sizes?

I'm calculating the NPS (Net Promoter Scores) for about 50 different sessions at a recent event. Each session was attended by about 50-500 people, and the number of survey responses for each session ranges between 15-400.
If I know:
The number of respondents for each session (sample size)
The number of attendees for each session (population size)
The NPS score for each session (average rating, basically—more info below)
How can I figure out the margin of error and/or confidence intervals for each session in Excel?
What formula would I use where, for example, X = sample size, Y = population size, and Z = avg rating?
I don't need this to be incredibly correct as long as I'm in the ballpark—so you can ignore the NPS part which might throw things off slightly:
This is slightly complicated by the fact that NPS is a weird metric.
It asks "How likely would you be to recommend X to a friend or
colleague?" with a scale from 0-10 (10 = extremely likely, 0 = not at
all likely). You then count every 10 and 9 as a "promoter," count
every 8 and 7 as a "neutral" or "passive," and count everything
between 6 and 0 as a "detractor."
You then get the NPS by subtracting the detractors from the promoters, dividing that number by the total responses, then multiplying it by 100, so: ((Promoters - Detractors)/(Total responses))*100. NPS sort of flattens every response to a +1, 0, or -1, so it might complicate the calculations.
Assume I've already calculated the NPS for each session. I'm trying to figure out the margin of errors and/or confidence intervals for each session using Excel.
So, for example, my data would look like this:
Again, you can ignore the NPS stuff if it makes it easier and just assume it's an average rating where people were asked to rate each session from -100 to +100. What function(s) would I use in Excel to find the margin of error and/or confidence intervals for each session, given the sample size and target population size, and the average rating?

Calculate Average from Groups

I'm trying to take a table of web data (average % of page viewed) and create an average.
This is what my table looks like:
0-25% 954,353
26-50% 58,569
76-100% 73,653
51-75% 31,011
I'm looking to calculate in a cell that the average across all is XX %.

I guess this is what you are looking for:
Due to a lack of more information, we do not know what the actual distribution of the items in the range from 0 - 25% is. Hence, I am assuming that they all average out at 12,5% (the median). If you continue this line of thought then the overall average is nothing but an average of the medians or (looking at the formula) a SumProduct divided by the Sum of all items.

Tests to Compare Sales Mix Percent between Periods

Background
I wish to compare menu sales mix ratios for two periods.
A menu is defined as a collection of products. (i.e., a hamburger, a club sandwich, etc.)
A sales mix ratio is defined as a product's sales volume in units (i.e., 20 hamburgers) relative to the total number of menu units sold (i.e., 100 menu items were sold). In the hamburger example, the sales mix ratio for hamburgers is 20% (20 burgers / 100 menu items). This represents the share of total menu unit sales.
A period is defined as a time range used for comparative purposes (i.e., lunch versus dinner, Mondays versus Fridays, etc.).
I am not interested in overall changes in the volume (I don't care whether I sold 20 hamburgers in one period and 25 in another). I am only interested in changes in the distribution of the ratios (20% of my units sold were hamburgers in one period and 25% were hamburgers in another period).
Because the sales mix represents a share of the whole, the mean average for each period will be the same; the mean difference between the periods will always be 0%; and, the sum total for each set of data will always be 100%.
Objective:
Test whether the sales distribution (sales mix percentage of each menu item relative to other menu items) changed significantly from one period to another.
Null Hypothesis: the purchase patterns and preferences of customers in period A are the same as those for customers in period B.
Example of potential data input:
[Menu Item] [Period A] [Period B]
Hamburger 25% 28%
Cheeseburger 25% 20%
Salad 20% 25%
Club Sandwich 30% 27%
Question:
Do common methods exist to test whether the distribution of share-of-total is significantly different between two sets of data?
A paired T-Test would have worked if I was measuring a change in the number of actual units sold, but not (I believe) for a change in share of total units.
I've been searching online and a few text books for a while with no luck. I may be looking for the wrong terminology.
Any direction, be it search terms or (preferably) the actual names appropriate tests, are appreciated.
Thanks,
Andrew
EDIT: I am considering a Pearson Correlation test as a possible solution - forgetting that each row of data are independent menu items, the math shouldn't care. A perfect match (identical sales mix) would receive a coefficient of 1 and the greater the change the lower the coefficient would be. One potential issue is that unlike a regular correlation test, the changes may be amplified because any change to one number automatically impacts the others. Is this a viable solution? If so, is there a way to temper the amplification issue?

Consider using a Chi Squared Goodness-of-Fit test as a simple solution to this problem:
H0: the proportion of menu items for month B is the same as month A
Ha: at least one of the proportions of menu items for month B is
different to month A
There is a nice tutorial here.

Monte Carlo Simulation using Excel Solver

I am trying to figure out what the optimal number of products I should make per day are, displaying the values in a chart and then using the chart to find the optimal number of products to make per day.
Cost of production: $4
Sold for: $12
Leftovers sold for $1
So the ideal profit for a product is $8, but it could be -$3 if it's left over at the end of the day.
The daily demand of sales has a mean of 150 and a standard deviation of 30.
I have been able to generate a list of random numbers using to generate a list of how many products: NORMINV(RAND(),mean,std_dev)
but I don't know where to go from here to figure out the amount sold from the amount of products made that day.

The number sold on a given day is min(# produced, daily demand).
ADDENDUM
The decision variable is a choice you make: "I will produce 150 each day", or "I will produce 145 each day". You told us in the problem statement that daily demand is a random outcome with a mean of 150 and a SD of 30. Let's say you go with producing 150, the mean of demand. Since it's the mean of a symmetric distribution, half the time you will sell everything you made and have no losses, but in most of those cases you actually could have sold more and made more money. You can't sell products you didn't make, so your profit is capped at selling 150 on those days. The other half of the time, you won't sell all 150 and will take a loss on the unsold items, reducing your profit a bit. The actual profit on any given day is a random variable, because it is determined by random demand.
Since profit is random, you can calculate your average earnings across many days based on the assumption that you produce 150. You can also average earnings based on the assumption that you produce 140 per day, or 160 per day, or any other number. It sounds like you've been asked to plot those average earnings versus how many you decided to produce, and choose a production level that results in the highest long-term average earnings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string