DAX Normal Distribution - excel

I am looking for a function in DAX that acts similarly to the NORM.DIST function in Excel. I have an X value, the mean, and the standard deviation and I am looking to find the cumulative distribution (with an accuracy of at least six sigma).
I've searched through the official list of DAX statistical functions, but I could not find any function that does this. I'm looking for the most economical way to perform this calculation. Is the Bell approximation the best way to go? This would be iterated over a table of about 10,000 rows.

SIGN([X])*0.5*(1-(1/30)*(7*EXP(-([X]^2)/2)+16*EXP(-([X]^2)*(2-SQRT(2)))+(7+0.25*PI()*[X]^2)*EXP(-([X]^2))))^0.5+0.5
Given [X] is already normalized. MAX Error = .0000304
Error:
Compared to previous solution:
http://mathworld.wolfram.com/NormalDistributionFunction.html (14)

=.5*(1+SIGN([X])*(1-EXP(-2*([X])/[Sigma])^2/PI()))^.5)
Worked for me. This may be the fastest way, this may not. Idk.
EDIT
MAX Error = .0031458

Related

Replicating Excel averageifs in PowerBI

I'm having difficulty replicating the following Excel calculation in power BI
=IFERROR(AVERAGEIFS(Data!$I:$I,Data!$A:$A,Tables!$C$2,Data!$B:$B,Tables!$E$2,Data!$E:$E,Tables!$B5), "N/A")
I am trying to calculate an average on 3 values, area, period and metric. In power bi using the quick measure it returns either the count of the metric title or the average of the metric, with an additional row for the values that are marked as n/a.
Count of Raw_Score average per metric_ref =
AVERAGEX(
KEEPFILTERS(VALUES('Data'[metric_ref])),
CALCULATE(COUNTA('Data'[Raw_Score]))
)
files / images here
maybe I understood the question wrong, so feel free to correct me, but you are simply trying to calculate an average for different groups, is that so?
First, when working with PowerBI do yourself a favor and forget how Excel works, your life ll be much easier.
Now for the solution.
The trouble is, that your score metric is not a correct data type for average calculation. In Edit Queries, change data type to number (prior step of replacing "N/A" to "" might be required)
(optional step) I would recommend fixing data type of all relevant columns.
With data in correct format, you simply create visualization and slice it with grouping label. Something like this:
Notice the small arrow near the Value-theme_ref field (in your case you should probably substitute it with Raw_Score columns). You simply change the calculation from Sum to Average, which should do the trick.
Once again, I apologize if I misunderstood the question. Feel free to specify.

Calculating percentile - Excel vs online

I have a set of data say {4,7,7,10,10,12,12,14,15,67} and i want to know the 95th Percentile. I used Excel and Online calculator.
Both gave different answers.
In Excel, formula i used : =PERCENTILE.INC(A1:A10,0.95) and result = 43.6
But this online percentile calculator yielded a result of 67
Which one is right?
First of all, both methods are "right" in the sense that both implement a standard algorithm for computing percentiles. Unlike the mean or median (where all sources use the same approach) there are many different approaches to calculating percentiles. The fundamental issue is that there is no obvious solution to the problem of what to do with percentiles which fall between observations. Do you take the observed value which is closest? Do you interpolate between the two? If so -- with what weighting factors do you do the interpolation? Wikipedia discusses nine (!) with both the Excel approach and the approach from that online percentile calculator making the list. See this paper for a very nice discussion of these algorithms.
You can replicate the functionality of that online percentile function like thus:
=SMALL(A1:A10,CEILING.MATH(COUNT(A1:A10)*0.95))
For example:
The point of using the function SMALL rather than a direct numerical index is that this approach works even if the data isn't sorted.

Excel/Statistics Issue

I have a homework assignment where I need to run 1000 simulations in excel using an exponential distribution. I'm not sure how to get excel to give me the data I need. This is the question (the first part, at least):
I've figured out how to use the exponential distribution formula in excel, but I can't get it to return any values greater than 1. I think it's just the nature of that function, but I can't figure out how to get Excel to display the simulated lifetimes of the components. Any help at all would be much appreciated.
with λ in cell N4, the formula =-1/$N$4*LN(1-RAND()) yields a random number from the exponential distribution with parameter λ
from http://www.tushar-mehta.com/publish_train/xl_vba_cases/0806%20generate%20random%20numbers.shtml

Removing Lower/Upper Fence of outliers from input data to then be evaluated

What I have attempted:
AVERAGEIF(B11:V11,">+MEDIAN(B11:V11)")
What I am trying to do:
I would like to take the average of the upper half of given data. Elaborating more. I would like to find a formula that will allow me to remove a given lower fence of outliers and dissect the data then given to me. I would greatly prefer to maintain this formula within one cell "not grabbing different results from formulas within multiple cells".
Update:
Following through I found the solution.. I think.
One thing I should have explained further:
The data coming in replicating a typical sqrt function.
What I wanted to achieve is to capture the mean of the "plateau" of the data.
The equation I used was:
=AVERAGEIF(B3:B62,(">"&+TRIMMEAN(B3:B62,0.8)),B3:B62)
This was something I just copied and pasted. of course "B3" and "B62" are significant only for my application.
My rough explanation of the equation:
TRIMMEAN will limit the AVERAGE to the top 20%(">")(0.8) of the data selected. So for my application, this SHOULD give me a rough mean of the "plateau" of the data i would like to find the mean for.
This formula calculates the Median() of the range, then AverageIf() uses the median and only grabs values that are greater than or equal to >= the median ~ giving you the average of the 'top-half' of your values.
AVERAGEIF(A1:A10,">="&MEDIAN(A1:A10))
Hope this help!

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Resources