Normal distribution shapes: symmetric or skewed - statistics

I'm having hard time differentiating between shapes: Symmetric & Skewed There are some clear graphs. You don't need to think twice But here for example:
The histogram makes me really confused Is it a right skewed? is it symmetric?
Totally lost. :(
I have tried many ways to get the right answers:
comparing between mean=26.75 and median=25.5 values
calculating the following distances:
From min to the median is (less/equal or greater) than the one from
median to the max.
From the min value to Q1 is (less/equal or greater) than the one from
Q3 to the max.
From Q1 to the median is (less/equal or greater) than the one from
the median to Q3.
It does not lead me to any conclusion.
Right skewed: Everything follows the right skewed rules except c)
Symmetric: Looking at the graph, it seems symmetric, but depending on the calculation it is not.
Help please. Thank you in advance.

Related

Weighted Least Squares vs Monte Carlo comparison

I have an experimental dataset of the following values (y, x1, x2, w), where y is the measured quantity, x1 and x2 are the two independet variables and w is the error of each measurement.
The function I've chosen to describe my data is
These are my tasks:
1) Estimate values of bi
2) Estimate their standard errors
3) Calculate predicted values of f(x1, x2) on a mesh grid and estimate their confidence intervals
4) Calculate predicted values of
and definite integral
and their confidence intervals on a mesh grid
I have several questions:
1) Can all of my tasks be solved by weighted least squares? I've solved task 1-3 using WLS in matrix form by linearisation of the chosen function, but I have no idea, how to solve step №4.
2) I've performed Monte Carlo simulations to estimate bi and their s.e. I've generated perturbated values y'i from normal distribution with mean yi and standard deviation wi. I did this operation N=5000 times. For each perturbated dataset I estimated b'i, and from 5000 values of b'i I calculated mean values and their standard distribution. In the end, bi estimated from Monte-Carlo simulation coincide with those found by WLS. Am I correct, that standard deviations of b'i must be devided by № of Degrees of freedom to obtain standard error?
3) How to estimate confidence bands for predicted values of y using Monte-Carlo approach? I've generated a bunch of perturbated bi values from normal distribution using their BLUE as mean and standard deviations. Then I calculated lots of predicted values of f(x1,x2), found their means and standard deviations. Values of f(x1,x2) found by WLS and MC coincide, but s.d. found from MC are 5-45 order higher than those from WLS. What is the scaling factor that I'm missing here?
4) It seems that some of parameters b are not independent of each other, since there are only 2 independent variables. Should I take this into account in question 3, when I generate bi values? If yes, how can this be done? Should I use Chi-squared test to decide whether generated values of bi are suitable for further calculations, or should they be rejected?
In fact, I not only want to solve tasks I've mentioned earlier, but also I want to compare the two methods for regression analysys. I would appreciate any help and suggestions!

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Normal distribution shapes

enter image description here
Hi everyone
I'm having hard time differentiating between shapes: Symmetric & Skewed
There are some clear graphs. You don't need to think twice
But here for example: the histogram makes me really confused
Is it a right skewed? is it symmetric? Totally lost.
I have tried many ways to get the right answers:
1- comparing between mean=26.75 and median=25.5 values
2- calculating the following distances:
a) The distance from the min to the median is (less/equal or greater)
than the median to the max.
b) The distance from the minimum value to Q1 is (less/equal or
greater) than the one from Q3 to the max
c) The distance from Q1 to the median is (less/equal or greater) than
the one from the median to Q3
It does not lead me to any conclusion
Right skewed: Everything follows the right skewed rules except c)
Symmetric: Looking at the graph, it seems symmetric, but depending on the calculation it is not
Help please.
Thank you in advance.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Algorithm for drawing box plot for given data

I have sorted array of real values, say X, drawn from some unknown distribution. I would like draw a box plot for this data.
In the simplest case, I need to know five values: min, Q1, median, Q3, and max.
Trivially, min = X[0], max = X[length(X)-1], and possibly median = X[ceil(length(X)/2)]. But I'm wondering how to determine the lower quartile Q1 and Q3.
When I plot X = [1,2,4] using MATLAB, I obtain following result:
It seems to me like there is some magic how to obtain the values Q1 = 1.25 and Q3 = 3.5, but I don't know what the magic is. Does anybody have experience with this?
If you go to the original definition of box plots (look up John Tukey), you use the median for the midpoint (i.e., 2 in your data set of 1, 2, 4). The endpoints are the min and max.
The top and bottom of the box are not exactly defined by quartiles, instead they are called "hinges". Hinges are the medians of the top and bottom halves of the data. If there is an odd number of observations, the median of the entire set is used in determining both hinges. The lower hinge is the median of (1,2), or 1.5. The top hinge is the median of (2,4), or 3.
There are actually dozens of definitions of a box plot's quartiles (Wikipedia: "There is no universal agreement on choosing the quartile values"). If you want to rationalize MatLab's box plot, you'll have to check its documentation. Otherwise, you could Google your brains out to try to find a method that matches the results.
Minitab gives 1 and 4 for the hinges in your data set. Excel's PERCENTILE function gives 1.5 and 3, which incidentally matches Tukey's algorithm at least in this case.
The median devides the data into two halves. The median of the first half = Q1, and the median of the second half = Q3.
More info: http://www.purplemath.com/modules/boxwhisk.htm
Note on the MatLab boxplot: The Q1 and Q3 are maybe calculated in a different way in MatLab, I'd try with a larger amount of testing data. With my method, Q1 should be 1 and Q3 should be 4.
EDIT:
The possible calculation that MatLab does, is the difference between the median and the first number of the first half, and take a quarter of that. Add that to the first number to get Q1.
The same (roughly) applies to Q3: Take the difference between the median and the highest number, and subtract a quarter of that from the highest number. That is Q3.

Resources