Algorithm to match two histograms using only the x-axis variable - statistics

I have two histograms that should be very similar (typically a data/simulation plot). At first sight, the simulation distribution seems to have a shift difference with the data distribution. A naive attempt to correct it is to shift the simulation histogram such that the mean is matching the data distribution mean. The histograms are filled per event in a loop, so in each event, I fill (x-axis value + shift value) to correct the simulation. However, the agreement is only improved in the core of the distribution.
I was wondering if there is a smarter approach/algorithm to correct the simulation distribution to match data as much as possible, using only a correction that corrects the x-axis value (i.e. no event number scaling), it needs to be applied on the event level.
Many thanks!

Related

Create Normal Distribution curve in Excel

Trying to draw a Bell Curve/Normal Distribution curve with the data set provided, but it is not getting created on Excel. Can anyone help me in creating the same.
https://docs.google.com/spreadsheets/d/1ipDo6WlbmDUBZuuS4ya3ZGD7mkP_vnbByK3KvyLbJ88/edit?usp=sharing
The above file can be used as the data set for creating the curve. Can someone explain me the procedure of how to make a curve with the above data set in Excel?
if your data is normally distributed it should resemble a bell curve.
By "Trying to draw a Bell Curve/Normal Distribution curve", are you referring to a line diagram?
Remember, the bell curve is a histogram of your data. If you inserted a histogram of your data, would that be enough?
If not, what you could do is calculate the standard deviation of your data (and the mean), then you could make a column for different standard deviations and what value we expect it to be.
We could then incorporate that into your old histogram. You could use a "Combo" chart and plot the histogram on one axis and the a line for your calculated values (you can make it smooth if you think it's too sharp. Also, you could decrease the distance between each of your calculated values (1.1, 1.2, ...) instead of let's say halves of standard deviations.
Unfortunately, the data you provided is not at all normally distributed.
So you can't create a bell curve based on this data, no.

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Generate positive only distribution based on array

I have an array of data, for example:
[1000,800,700,650,630,500,370,350,310,250,210,180,150,100,80,50,30,20,15,12,10,8,6,3]
From this data, I want to generate random numbers that fit the same distribution.
I can generate a random number using code like the following:
dist = scipy.stats.gaussian_kde(data)
randomVar = np.floor(dist.resample()[0])
This results in random number generation that includes negative numbers, which I believe I can dump fairly easily without changing the overall shape of the rest of the curve (I just generate sufficient resamples that I still have enough for purpose after dumping the negatives).
However, because the original data was positive values only - and heaped up against that boundary, I end up with a kde that is highest a short distance before it gets to zero, but then drops off sharply from there as it approaches zero; and that downward tick in the KDE is preventing me from generating appropriate numbers.
I can set the bandwidth lower, in order to get a sharper corner, closer to zero, but then due to the low quantity of the original data it ends up sawtoothing elsewhere. Higher bandwidths unfortunately hide the shape of the curve before they remove the downward tick.
As broadly suggested in the comments by Hilbert's Drinking Problem, the real solution was to find a better distribution that fit the parameters. In my case Chi-Squared, which fit both the shape of the curve, and also the fact that it only took positive values.
However in the comments Stelios made the good suggestion of using scipy.stats.rv_histogram, which I used and was satisfied with for a while. This enabled me to fit a curve to the data exactly, though it had two problems:
1) It assumes zero value in the absence of data. I.e. if you set the
settings to fit too closely to the data, then during gaps in your
data it will drop to zero rather than interpolate.
2) As an extension
to point 1, it wont extrapolate beyond the seed data's maximum and
minimum (those data ranges are effectively giant gaps, so everything
eventually zeroes out).

Why do Excel and Matlab give different results?

I have 352k values and I want to find the most frequent values from all of them.
Numbers are rounded to two decimal places.
I use the commands mode(a) in Matlab and mode(B1:B352000) in Excel, but the results are different.
Where did I make a mistake, or which one can I believe?
Thanks
//edit: When I use other commands like average, the results are the same.
From Wikipedia:
For a sample from a continuous distribution, such as [0.935..., 1.211..., 2.430..., 3.668..., 3.874...], the concept is unusable in its raw form, since no two values will be exactly the same, so each value will occur precisely once. In order to estimate the mode of the underlying distribution, the usual practice is to discretize the data by assigning frequency values to intervals of equal distance, as for making a histogram, effectively replacing the values by the midpoints of the intervals they are assigned to. The mode is then the value where the histogram reaches its peak. For small or middle-sized samples the outcome of this procedure is sensitive to the choice of interval width if chosen too narrow or too wide
Thus, it is likely that the two programs use a different interval size, yielding different answers. You can believe both (I presume) but knowing that the value returned is an approximation to the true mode of the undelying distribution.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Resources