Normal distribution shapes - statistics

enter image description here
Hi everyone
I'm having hard time differentiating between shapes: Symmetric & Skewed
There are some clear graphs. You don't need to think twice
But here for example: the histogram makes me really confused
Is it a right skewed? is it symmetric? Totally lost.
I have tried many ways to get the right answers:
1- comparing between mean=26.75 and median=25.5 values
2- calculating the following distances:
a) The distance from the min to the median is (less/equal or greater)
than the median to the max.
b) The distance from the minimum value to Q1 is (less/equal or
greater) than the one from Q3 to the max
c) The distance from Q1 to the median is (less/equal or greater) than
the one from the median to Q3
It does not lead me to any conclusion
Right skewed: Everything follows the right skewed rules except c)
Symmetric: Looking at the graph, it seems symmetric, but depending on the calculation it is not
Help please.
Thank you in advance.

Related

Weighted Least Squares vs Monte Carlo comparison

I have an experimental dataset of the following values (y, x1, x2, w), where y is the measured quantity, x1 and x2 are the two independet variables and w is the error of each measurement.
The function I've chosen to describe my data is
These are my tasks:
1) Estimate values of bi
2) Estimate their standard errors
3) Calculate predicted values of f(x1, x2) on a mesh grid and estimate their confidence intervals
4) Calculate predicted values of
and definite integral
and their confidence intervals on a mesh grid
I have several questions:
1) Can all of my tasks be solved by weighted least squares? I've solved task 1-3 using WLS in matrix form by linearisation of the chosen function, but I have no idea, how to solve step №4.
2) I've performed Monte Carlo simulations to estimate bi and their s.e. I've generated perturbated values y'i from normal distribution with mean yi and standard deviation wi. I did this operation N=5000 times. For each perturbated dataset I estimated b'i, and from 5000 values of b'i I calculated mean values and their standard distribution. In the end, bi estimated from Monte-Carlo simulation coincide with those found by WLS. Am I correct, that standard deviations of b'i must be devided by № of Degrees of freedom to obtain standard error?
3) How to estimate confidence bands for predicted values of y using Monte-Carlo approach? I've generated a bunch of perturbated bi values from normal distribution using their BLUE as mean and standard deviations. Then I calculated lots of predicted values of f(x1,x2), found their means and standard deviations. Values of f(x1,x2) found by WLS and MC coincide, but s.d. found from MC are 5-45 order higher than those from WLS. What is the scaling factor that I'm missing here?
4) It seems that some of parameters b are not independent of each other, since there are only 2 independent variables. Should I take this into account in question 3, when I generate bi values? If yes, how can this be done? Should I use Chi-squared test to decide whether generated values of bi are suitable for further calculations, or should they be rejected?
In fact, I not only want to solve tasks I've mentioned earlier, but also I want to compare the two methods for regression analysys. I would appreciate any help and suggestions!

Distance between straight lines

I work in the oil & gas industry and I'm seeking advice about how to calculate the minimum distance between a set of wells (the wells are drawn as straight lines on a map). My goal is for each individual well to have a unique "spacing" value (measured in feet) which is basically the straight-line horizontal distance to the closest wellbore on a map. Below is a simple example of what I'm trying to accomplish (assume the pipe | symbol is a wellbore and the dashes are the distance between the wells)
|--|---|-|
In the drawing above we have 4 wells. The 1st well (starting from the far left) would have a spacing value of 2 (since there are 2 dashes to the closest well), the 2nd well would also have a value of 2 (since the closest well is the one to the far left which is two spaces away), the 3rd well would have a value of 1, and the 4th well would have a value of 1.
Now imagine that I have hundreds of these wells (each with latitude/longitude points that describe the start & end points of each well) and I have them all mapped in TIBCO Spotfire (scattered across Texas). Do you guys know if it would even be possible to automate a calculation like the above? I would also like to build in a rule that says the max distance between wells is 2640 ft (half of a mile).
Any ideas are appreciated!
I think you should be able to do this without any R or iron python.
Within Spotfire, you can calculate the distance in miles between 2 points using the formula below (substitute 6371 for 3958.756 to get the answer in kilometres).
GreatCircleDistance([Lat 1],[Lon 1],[Lat 2],[Lon 2]) * 3958.756
For your use case, you could cross join your table of locations, so that you have a row for every possible location combination, then calculate the distance between them using the formula above. After that, it should be pretty straight forward to find each wells closest pair.

Normal distribution shapes: symmetric or skewed

I'm having hard time differentiating between shapes: Symmetric & Skewed There are some clear graphs. You don't need to think twice But here for example:
The histogram makes me really confused Is it a right skewed? is it symmetric?
Totally lost. :(
I have tried many ways to get the right answers:
comparing between mean=26.75 and median=25.5 values
calculating the following distances:
From min to the median is (less/equal or greater) than the one from
median to the max.
From the min value to Q1 is (less/equal or greater) than the one from
Q3 to the max.
From Q1 to the median is (less/equal or greater) than the one from
the median to Q3.
It does not lead me to any conclusion.
Right skewed: Everything follows the right skewed rules except c)
Symmetric: Looking at the graph, it seems symmetric, but depending on the calculation it is not.
Help please. Thank you in advance.

Excel formula to calculate the distance between multiple points using lat/lon coordinates

I'm currently drawing up a mock database schema with two tables: Booking and Waypoint.
Booking stores the taxi booking information.
Waypoint stores the pickup and drop off points during the journey, along with the lat lon position. Each sequence is a stop in the journey.
How would I calculate the distance between the different stops in each journey (using the lat/lon data) in Excel?
Is there a way to programmatically define this in Excel, i.e. so that a formula can be placed into the mileage column (Booking table), lookup the matching sequence (via bookingId) for that journey in the Waypoint table and return a result?
Example 1:
A journey with 2 stops:
1 1 1 MK4 4FL, 2, Levens Hall Drive, Westcroft, Milton Keynes 52.002529 -0.797623
2 1 2 MK2 2RD, 55, Westfield Road, Bletchley, Milton Keynes 51.992571 -0.72753
4.1 miles according to Google, entry made in mileage column in Booking table where id = 1
Example 2:
A journey with 3 stops:
6 3 1 MK7 7DT, 2, Spearmint Close, Walnut Tree, Milton Keynes 52.017486 -0.690113
7 3 2 MK18 1JL, H S B C, Market Hill, Buckingham 52.000674 -0.987062
8 3 1 MK17 0FE, 1, Maids Close, Mursley, Milton Keynes 52.040622 -0.759417
27.7 miles according to Google, entry made in mileage column in Booking table where id = 3
If you want to find the distance between two points just use this formula and you will get the result in Km, just convert to miles if needed.
Point A: LAT1, LONG1
Point B: LAT2, LONG2
ACOS(COS(RADIANS(90-Lat1)) *COS(RADIANS(90-Lat2)) +SIN(RADIANS(90-Lat1)) *SIN(RADIANS(90-lat2)) *COS(RADIANS(long1-long2)))*6371
Regards
Until quite recently, accurate maps were constructed by triangulation, which in essence is the application of Pythagoras’s Theorem. For the distance between any pair of co-ordinates take the square root of the sum of the square of the difference in x co-ordinates and the square of the difference in y co-ordinates. The x and y co-ordinates must however be in the same units (eg miles) which involves factoring the latitude and longitude values. This can be complicated because the factor for longitude depends upon latitude (walking all round the North Pole is less far than walking around the Equator) but in your case a factor for 52o North should serve. From this the results (which might be checked here) are around 20% different from the examples you give (in the second case, with pairing IDs 6 and 7 and adding that result to the result from pairing IDs 7 and 8).
Since you say accuracy is not important, and assuming distances are small (say less than 1000 miles) you can use the loxodromic distance.
For this, compute the difference of latitutes (dlat) and difference of longitudes (dlon). If there were any chance (unlikely) that you're crossing meridian 180º, take modulo 360º to ensure the difference of longitudes is between -180º and 180º. Also compute average latitude (alat).
Then compute:
distance= 60*sqrt(dlat^2 + (dlon*cos(alat))^2)
This distance is in nautical miles. Apply conversions as needed.
EXPLANATION: This takes advantage of the fact that one nautical mile is, by definition, always equal to one minute-arc of latitude. The cosine corresponds to the fact that meridians get closer to each other as they approach the poles. The rest is just application of Pythagoras theorem -- which requires that the relevant portion of the globe be flat, which is of course only a good approximation for small distances.
It all depends on what the distance is and what accuracy you require. Calculations based on "Earth locally flat" model will not provide great results for long distances but for short distance they may be ok. Models assuming Earth is a perfect sphere (e.g. Haversine formula) give better accuracy but they still do not produce geodesic grade results.
See Geodesics on an ellipsoid for more details.
One of the high accuracy (fraction of a mm) solutions is known as Vincenty's formulae. For my Excel VBA implementation look here https://github.com/tdjastrzebski/Vincenty-Excel

Algorithm for drawing box plot for given data

I have sorted array of real values, say X, drawn from some unknown distribution. I would like draw a box plot for this data.
In the simplest case, I need to know five values: min, Q1, median, Q3, and max.
Trivially, min = X[0], max = X[length(X)-1], and possibly median = X[ceil(length(X)/2)]. But I'm wondering how to determine the lower quartile Q1 and Q3.
When I plot X = [1,2,4] using MATLAB, I obtain following result:
It seems to me like there is some magic how to obtain the values Q1 = 1.25 and Q3 = 3.5, but I don't know what the magic is. Does anybody have experience with this?
If you go to the original definition of box plots (look up John Tukey), you use the median for the midpoint (i.e., 2 in your data set of 1, 2, 4). The endpoints are the min and max.
The top and bottom of the box are not exactly defined by quartiles, instead they are called "hinges". Hinges are the medians of the top and bottom halves of the data. If there is an odd number of observations, the median of the entire set is used in determining both hinges. The lower hinge is the median of (1,2), or 1.5. The top hinge is the median of (2,4), or 3.
There are actually dozens of definitions of a box plot's quartiles (Wikipedia: "There is no universal agreement on choosing the quartile values"). If you want to rationalize MatLab's box plot, you'll have to check its documentation. Otherwise, you could Google your brains out to try to find a method that matches the results.
Minitab gives 1 and 4 for the hinges in your data set. Excel's PERCENTILE function gives 1.5 and 3, which incidentally matches Tukey's algorithm at least in this case.
The median devides the data into two halves. The median of the first half = Q1, and the median of the second half = Q3.
More info: http://www.purplemath.com/modules/boxwhisk.htm
Note on the MatLab boxplot: The Q1 and Q3 are maybe calculated in a different way in MatLab, I'd try with a larger amount of testing data. With my method, Q1 should be 1 and Q3 should be 4.
EDIT:
The possible calculation that MatLab does, is the difference between the median and the first number of the first half, and take a quarter of that. Add that to the first number to get Q1.
The same (roughly) applies to Q3: Take the difference between the median and the highest number, and subtract a quarter of that from the highest number. That is Q3.

Resources