Statistics: Identify outlier values

Statistics: Identify outlier values - statistics

Good day,
I am new to programming in R. I want to identify outlier actual values from my data. I know that to identify an outlier you need Q1-15.IQR and Q3+1.5IQR. I need help with finding those outlier values. I don't know how to approach the problem.
Thank you for your help in advance

Related

How would I write a temporal forecast code in Microsoft excel

An example, the time someone left home and the time someone called 9-1-1 and put these points in to predict ideally the time of incident on an excel format. I can put in a time in column a and column b but all it does is give me the half way point between the two. example column a says 12:00 and column b says 1:00 and the result would be 12:30. If I can get some thing more predictive using this approach, that is ideally what I'm looking for.

I used some of the standard functions in Excel to predict time based series.
We were looking at predicting data points for 1mis, 3mis and 6mis (mis = Months In Service).
We found that the forecast() function with some "fiddle" factors - sorry finely tuned polynomial assumptions - gave a reasonable prediction for our needs. We fed it steps of historical data to see the performance until it was suitable for what we needed.

Finding Percentile from Data set and eliminating outliers

Disclaimer, I am completely new to excel coding VBA etc and I am having a tough time even figuring out what I have to do.
In short, I have lengthy amount of raw data, and I need to determine the 95th and 5th percentile for each day (i.e. 14/3/2017), and subsequently remove rows with the values above/below the 95th and 5th percentile. It is tedious to find the percentile value for each day and use the or function to remove the outliers. I appreciate all help and please understand that I am very new to this hence I may not understand what you reply or where to put any code.
Thank you for all help

Probability of 6 independent events in excel

I want to calculate the probability of 6 independent events in excel. I know the general formula to do this howver it is very cumbersome to implement in excel.
Is there a better way?

The equation in your question suggests that you are asking for the probability of one or more of the events happening. This is the same as asking for the probability of "not none of them" happening. In Excel, you can calculate this using the following formula:
=1-PRODUCT(1-A1,1-A2,1-A3,1-A4,1-A5,1-A6)
(Here I am assuming that the probability of the events are given in cells A1:A6.)
However, if instead you are asking for the probability of all of the events happening, it's just the product of their individual probabilities:
=PRODUCT(A1,A2,A3,A4,A5,A6)

Simple formula or full function

first of all let me say that my knowledge of Excel is somewhat basic. I know about formulas but not indepth. I know about functions but nothing about programing. That being said, on with my question.
I'm currently building a office hockey pool spread sheet. I have my main sheet, the result of the games and the differential (see link above for reference, text is in french sorry about that). I will have another sheet, prediction sheet, that the participant will fill with their prediction about who's gonna win the game and by how many points (differential).
Now, I need the prediction sheet to calculate the points attribution depending on the prediction.
Here's how it's supposed to calculate:
1 point for winning team prediction (per game)
1 point for good differential (no mather what team won)
2 point if differential is 3 or higher.
User predictions go on "prediction" sheet, they input what they think the diff. will be on the same side as the team they pick to win.
So what I want to know is, what would be the best way to go about this, with a formula or with a custom function in VBA? I need to determine 1: if the prediction is in the same cell as the other sheet and 2: if the differential is the same as the game result.
Ok, re-reading this I know it's kind of confusing, but it's clear in my head... sorry about that. If anyone of you can make sence out of my problem, please help me by guiding me in the right way. Thank you very much.

If found part of my solution.
I used this formula (I know it can be shorter and better but that's the extent of my knowledge in excel):
=IF(OR(ISBLANK(COMPILATION!A4);ISBLANK(COMPILATION!A5));0;IF(OR(A4=ABS(COMPILATION!A4-COMPILATION!A5);A5=ABS(COMPILATION!A4-COMPILATION!A5));IF(OR(A4>=3;A5>=3);2;1);0))

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...

Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.

That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2

You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...

Or graph a moving average of prices instead of the actual prices.

Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)

For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string