How to remove price outliers from dataset so they don't impact average price

How to remove price outliers from dataset so they don't impact average price - excel

In the bank we are using the average price from six vendors. But we now and then entcounter wrong prices due to the fact that one or more vendors some time publish an incorrect price and this affects the average price. I seek inspiration/suggestions to most effecient/correct way and which formula/logic to use to exclude these price outliers. Here below you see an example of the problem. The prices in red text are considered outliers.
Remaining challenge when only three vendors publish a price and when one of these is a outlier. How do I remove/exclude this outlier?
Note: SAND=True and FALSK= False.
Hope someone can help us out. Many thanks in advance ;-)
Kind regards
Soren Sig Mikkelsen

Solution1:
Easiest of way to remove is via defining the inter quartile range...
In below screenshot you can see the formulas with calculated values...
Logic is very simple, calculate the inter quartile range of your data & see if any of the 6 value is outside the upper/lower limit, If so then it's outlier...
Solution2:
Using Average & Std Deviation

Related

Number of days for delivery and number of orders delivered in two separate columns. Is there a way to get summary statistics about orders?

I've had a bit of trouble explaining this so please bear with me. I'm also very new to using excel so if there's a simple fix, I apologize in advance!
I have two columns, one listing number of days starting from 0 and increasing consecutively. The other column has the number of orders delivered. The two correspond to each other. For example, I've typed out how it would look below. It would mean that there were 100 orders delivered in 1 day, 150 orders delivered in 2 days, 800 orders delivered in 3 days, etc.
Is there a way to get summary statistics (mean, median, mode, upper and lower quartiles) for the number of days it took for the average order to get delivered? The only way I can think of solving this is to manually punch in "1" 100 times, "2" 150 times, etc. into a new column and take median, mean, and upper & lower quartile from that, but that seems extremely inefficient. Would I use a pivot table for this? Thank you in advance!
I tried using the data analysis add-on and doing summary statistics that way, but it didn't work. It just gave me the mean, median, mode, and quartiles of each individual column. It would have given me 3 for median number of days for delivery and 300 for median number of orders.

Method 1
The mean is just
=SUMPRODUCT(A2:A6,B2:B6)/SUM(B2:B6)
Mode is the value with highest frequency
=INDEX(A2:A6,MATCH(MAX(B2:B6),B2:B6,0))
The quartiles and median (or any other quantile by varying the value of p) from first principles following this reference
=LET(p,0.25,
values,A2:A6,
freq,B2:B6,
N,SUM(freq),
h,(N+1)*p,
floorh,FLOOR(h,1),
ceilh,CEILING(h,1),
frac,h-floorh,
cusum,SCAN(0,SEQUENCE(ROWS(values)),LAMBDA(a,c,IF(c=1,0,a+INDEX(freq,c-1)))),
xlower,XLOOKUP(floorh-1,cusum,values,,-1),
xupper,XLOOKUP(ceilh-1,cusum,values,,-1),
xlower+(xupper-xlower)*frac)
Method 2
If you don't like doing it this way, you can always expand the data like this:
=AVERAGE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=MODE(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),1)
=MEDIAN(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1))
and
=QUARTILE.EXC(XLOOKUP(SEQUENCE(SUM(B2:B6),1,0),SCAN(0,SEQUENCE(ROWS(A2:A6)),LAMBDA(a,c,IF(c=1,0,INDEX(B2:B6,c-1)+a))),A2:A6,,-1),3)

Multiple Criteria If Condition - Identify highest number based upon preceding criteria

I am trying to build a bit of a tracker within Excel/Google Sheets to identify which 'home loan' has the combination of highest interest rate PLUS loan/mortgage size.
The formula I've built so far works for the below conditions;
There is only 1 loan with the highest interest rate, or;
If there are multiple loans that have the highest interest rate, the loan that is the largest is included within that condition.
Where this formula doesn't work is when;
When there are multiple loans with the highest interest rate, however these loans are NOT the largest loan size... I believe the issue is due to the fact I'm including a 'max' statement as part of the match condition. Please see cell reference I8:L10 as an example.
I'm unsure how to achieve this within a formula to identify;
The loan with the highest interest rate
Of the loans with the highest interest rate, which has the largest mortgage/loan size.
Please note the Interest Rate's vary within the data set
Formula Used:
=if(countif($E3:$H3,max($E3:$H3))>1,
if(iserror(index($E$1:$H$1,match(1,(max($E3:$H3)=E3)*(max($A3:$D3)=A3),0))=I$1),"Inactive","Active"),
if(index($E$1:$H$1,match(max($E3:$H3),$E3:$H3,0))=I$1,"Active","Inactive"))
Link to spreadsheet
https://docs.google.com/spreadsheets/d/1GHN8-uX4RdkMz0IIvTTxB0TjAKtUMSdCTFHqXNgHVm4/edit?usp=sharing
Thanks in Advance!

You can use =FILTER($E3:$H3,$E3:$H3-Max($E3:$H3),"") to create an array of all loans with the highest interest rates. I suppose it ought to be possible to create a matching array of loan sizes and then extract the highest of these. However, the task appears intimidating.
Meanwhile I wonder if this much simpler approach will allow you to get to your desired result using simple IF functions.
=IF(A3=MAX($A3:$D3),"Hi loan","Low loan") & " / " & IF(E3=MAX($E3:$H3),"Hi Rate","Low rate")
You chose "Active" and "Inactive". I chose "Hi" and "Low". Both of these are translations of True and False and may be used in conditional formatting, for example, where you simply highlight loans with the highest interest rate in one colour, and choose a different colour where the highest rate is applied to the largest loan. The simple logic I express here can be expressed in words, too. It's just a matter of interpreting the two results provided by my formula to best suit your needs.

question regarding randbetween in excel and revenue

The Doobie Brothers garage band is planning a concert. Tickets are set at $20. Based on what other bands have done, they figure they should sell 350 tickets, but that could fluctuate. They figure the standard deviation of sales at 50 tickets. No shows are uniformly distributed between 1 and 10. Fixed costs are 5000.
How profitable is the concert likely to be?
So I am able to enter the excel formula for revenue 50*20 and subtract 5000 for FC, but I am having trouble deciphering how to account for the no show costs. I know that I have to use RANDBETWEEN(1,10) formula, but I am not sure if it gets multiplied or divided by something. Again, I am looking for what to do with the formula in the context of a profit equation.
If it helps, the mean for the number of tickets sold is 350 and stdev is 50, so I used that to get the number of attendees in a simulated sense...That is, NORM.INV(RAND(),350,50)
Of course, this problem may not be realistic in real life because promoters keep the money, but for the purposes of the problem...just assume that no promoters exist here.

Excel Index Function

I just want to thank you guys in advance. I think you guys are doing a great job in helping people out with programming stuff. Pats on the back for all of you.
Here is what I've been working on: I have daily stock price return data on about 4000 stocks. I want to add them to my portfolio after observing their performance for 12 months. I will choose the top 10% best performers and bottom 10% worst performers. I will create multiple portfolios over a period of time. I have done that with no problem.
I want to use the INDEX function to calculate the daily return of my portfolio. Not all 4000 stocks are in my portfolio, about 300 stocks are in my portfolio at any given time. The daily portfolio returns will be calculated by multiplying the weights (they are equal weighted, so 1/300) to that stock's return on the specific date. I assume it has to do with a combination of INDEX, SUMPRODUCT, and IF or MATCH functions.
I have been thinking this for a long time and I just can't get to the bottom of it. I have attached pictures for a portion of what I was working on. I think will give you a good picture of what I'm trying to do. I bet this is such an easy thing for you guys. I hope you can help me out! Thanks again!
PICTURES:IN or OUT portfolio & Stock's individual returns
Charles

Not sure I understood your problem, but here is a trial suggestion:
You get data for 4000 stocks while you are monitoring 300. So, you need to find the correct one within your sheet (there will be 3700 that will not match anything).
If you have your stocks listed in, say, column "A", you could use the function LOOKUP (well explained in the Web). If you need to get the row of your stock, you can use the function MATCH.
If this is not what you are looking for, it means that I (at least) did not understand you, so you would need to add details to your question.

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...

Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.

That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2

You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...

Or graph a moving average of prices instead of the actual prices.

Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)

For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string