Algorithm for drawing box plot for given data - statistics

I have sorted array of real values, say X, drawn from some unknown distribution. I would like draw a box plot for this data.
In the simplest case, I need to know five values: min, Q1, median, Q3, and max.
Trivially, min = X[0], max = X[length(X)-1], and possibly median = X[ceil(length(X)/2)]. But I'm wondering how to determine the lower quartile Q1 and Q3.
When I plot X = [1,2,4] using MATLAB, I obtain following result:
It seems to me like there is some magic how to obtain the values Q1 = 1.25 and Q3 = 3.5, but I don't know what the magic is. Does anybody have experience with this?

If you go to the original definition of box plots (look up John Tukey), you use the median for the midpoint (i.e., 2 in your data set of 1, 2, 4). The endpoints are the min and max.
The top and bottom of the box are not exactly defined by quartiles, instead they are called "hinges". Hinges are the medians of the top and bottom halves of the data. If there is an odd number of observations, the median of the entire set is used in determining both hinges. The lower hinge is the median of (1,2), or 1.5. The top hinge is the median of (2,4), or 3.
There are actually dozens of definitions of a box plot's quartiles (Wikipedia: "There is no universal agreement on choosing the quartile values"). If you want to rationalize MatLab's box plot, you'll have to check its documentation. Otherwise, you could Google your brains out to try to find a method that matches the results.
Minitab gives 1 and 4 for the hinges in your data set. Excel's PERCENTILE function gives 1.5 and 3, which incidentally matches Tukey's algorithm at least in this case.

The median devides the data into two halves. The median of the first half = Q1, and the median of the second half = Q3.
More info: http://www.purplemath.com/modules/boxwhisk.htm
Note on the MatLab boxplot: The Q1 and Q3 are maybe calculated in a different way in MatLab, I'd try with a larger amount of testing data. With my method, Q1 should be 1 and Q3 should be 4.
EDIT:
The possible calculation that MatLab does, is the difference between the median and the first number of the first half, and take a quarter of that. Add that to the first number to get Q1.
The same (roughly) applies to Q3: Take the difference between the median and the highest number, and subtract a quarter of that from the highest number. That is Q3.

Related

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Distance between straight lines

I work in the oil & gas industry and I'm seeking advice about how to calculate the minimum distance between a set of wells (the wells are drawn as straight lines on a map). My goal is for each individual well to have a unique "spacing" value (measured in feet) which is basically the straight-line horizontal distance to the closest wellbore on a map. Below is a simple example of what I'm trying to accomplish (assume the pipe | symbol is a wellbore and the dashes are the distance between the wells)
|--|---|-|
In the drawing above we have 4 wells. The 1st well (starting from the far left) would have a spacing value of 2 (since there are 2 dashes to the closest well), the 2nd well would also have a value of 2 (since the closest well is the one to the far left which is two spaces away), the 3rd well would have a value of 1, and the 4th well would have a value of 1.
Now imagine that I have hundreds of these wells (each with latitude/longitude points that describe the start & end points of each well) and I have them all mapped in TIBCO Spotfire (scattered across Texas). Do you guys know if it would even be possible to automate a calculation like the above? I would also like to build in a rule that says the max distance between wells is 2640 ft (half of a mile).
Any ideas are appreciated!
I think you should be able to do this without any R or iron python.
Within Spotfire, you can calculate the distance in miles between 2 points using the formula below (substitute 6371 for 3958.756 to get the answer in kilometres).
GreatCircleDistance([Lat 1],[Lon 1],[Lat 2],[Lon 2]) * 3958.756
For your use case, you could cross join your table of locations, so that you have a row for every possible location combination, then calculate the distance between them using the formula above. After that, it should be pretty straight forward to find each wells closest pair.

Envelope plot in excel

I am trying to plot the envelope (maximum) values of a series of data. What I need is not the maximum value of the y-axis as the value of x-axis increase but an envelope or spectrum which joins only the maximum points as the values of x-axis increase.
My data look like:
If I ask for the maximum y-values as the values of the x-axis increase, I will get this one (the black line is the maximum of all data as x is asceding):
But I need a line which joins only the next maximum points till x=30 and then the maximum values, which descend (from x=30 to x=100). The curve I need should be smooth and not follow the values of the data but only join the next maximum.
The next curve is the envelope but only after the absolute maximum point. At the left of the absolute maximum point the envelope is not the wished one:
After posting my questions (as comments), I think the following will do what you want (here I'm assuming I understood what you need):
1) At any point along the X axis, you already know how to recognize a maximum,
2) If (1) is correct, you will take into account a maximum (i.e. make it part of the envelope curve) if and only if:
a) All the points to the right are lower than the current maximum, and/or
b) All the points to the left are lower than the current maximum.
Intuitively, this should work.
EDIT:
Assuming that data is arranged in columns, say between B and D and rows 10 to 100, define in cell E10 the following:
=IF(AND(MAX(B10,D10)>MAX(B9:D9),AND(MAX(B10,D10)>MAX(B11:D11)),MAX(B10,D10),"")
This formula will result into a value if you have a local maximum in rows 11 to 99 or blanks otherwise. Then, drag the formula till row 100 and voilà!!!
Note that the first and last point (i.e. rows 10 and 100) might yield a wrong result though. To prevent that, just alter the formula in those two rows.
Hope this is what you were looking for.

What is median split? Is it same as median?

In one of the research paper I follow they have said "Classes have been derived from scores with a median split"
Can anyone please explain if this median split is same as median? Thank you :)
A median split is when a set of elements is dichotomised (i.e. split into two) according to the statistical median (50th percentile). One group will contain all elements greater than the median and the other group will contain all elements less than the median.
So if you have a series of numbers (e.g. from 1 to 6) and you do a median split (with the median being 3.5 on this occasion) you will essentially split the series of numbers into two groups:
Group a would be 1, 2, 3
and
Group b would be 4, 5, 6
You can see another example here:
For example, we administer a scale of Optimism, and then use a median-split to label the people above the median as Optimists, and those below the median as Pessimists.
Essentially, the median split is not the median itself but it is a technique that uses the median to perform a split on a set of elements.

Excel formula to calculate the distance between multiple points using lat/lon coordinates

I'm currently drawing up a mock database schema with two tables: Booking and Waypoint.
Booking stores the taxi booking information.
Waypoint stores the pickup and drop off points during the journey, along with the lat lon position. Each sequence is a stop in the journey.
How would I calculate the distance between the different stops in each journey (using the lat/lon data) in Excel?
Is there a way to programmatically define this in Excel, i.e. so that a formula can be placed into the mileage column (Booking table), lookup the matching sequence (via bookingId) for that journey in the Waypoint table and return a result?
Example 1:
A journey with 2 stops:
1 1 1 MK4 4FL, 2, Levens Hall Drive, Westcroft, Milton Keynes 52.002529 -0.797623
2 1 2 MK2 2RD, 55, Westfield Road, Bletchley, Milton Keynes 51.992571 -0.72753
4.1 miles according to Google, entry made in mileage column in Booking table where id = 1
Example 2:
A journey with 3 stops:
6 3 1 MK7 7DT, 2, Spearmint Close, Walnut Tree, Milton Keynes 52.017486 -0.690113
7 3 2 MK18 1JL, H S B C, Market Hill, Buckingham 52.000674 -0.987062
8 3 1 MK17 0FE, 1, Maids Close, Mursley, Milton Keynes 52.040622 -0.759417
27.7 miles according to Google, entry made in mileage column in Booking table where id = 3
If you want to find the distance between two points just use this formula and you will get the result in Km, just convert to miles if needed.
Point A: LAT1, LONG1
Point B: LAT2, LONG2
ACOS(COS(RADIANS(90-Lat1)) *COS(RADIANS(90-Lat2)) +SIN(RADIANS(90-Lat1)) *SIN(RADIANS(90-lat2)) *COS(RADIANS(long1-long2)))*6371
Regards
Until quite recently, accurate maps were constructed by triangulation, which in essence is the application of Pythagoras’s Theorem. For the distance between any pair of co-ordinates take the square root of the sum of the square of the difference in x co-ordinates and the square of the difference in y co-ordinates. The x and y co-ordinates must however be in the same units (eg miles) which involves factoring the latitude and longitude values. This can be complicated because the factor for longitude depends upon latitude (walking all round the North Pole is less far than walking around the Equator) but in your case a factor for 52o North should serve. From this the results (which might be checked here) are around 20% different from the examples you give (in the second case, with pairing IDs 6 and 7 and adding that result to the result from pairing IDs 7 and 8).
Since you say accuracy is not important, and assuming distances are small (say less than 1000 miles) you can use the loxodromic distance.
For this, compute the difference of latitutes (dlat) and difference of longitudes (dlon). If there were any chance (unlikely) that you're crossing meridian 180º, take modulo 360º to ensure the difference of longitudes is between -180º and 180º. Also compute average latitude (alat).
Then compute:
distance= 60*sqrt(dlat^2 + (dlon*cos(alat))^2)
This distance is in nautical miles. Apply conversions as needed.
EXPLANATION: This takes advantage of the fact that one nautical mile is, by definition, always equal to one minute-arc of latitude. The cosine corresponds to the fact that meridians get closer to each other as they approach the poles. The rest is just application of Pythagoras theorem -- which requires that the relevant portion of the globe be flat, which is of course only a good approximation for small distances.
It all depends on what the distance is and what accuracy you require. Calculations based on "Earth locally flat" model will not provide great results for long distances but for short distance they may be ok. Models assuming Earth is a perfect sphere (e.g. Haversine formula) give better accuracy but they still do not produce geodesic grade results.
See Geodesics on an ellipsoid for more details.
One of the high accuracy (fraction of a mm) solutions is known as Vincenty's formulae. For my Excel VBA implementation look here https://github.com/tdjastrzebski/Vincenty-Excel

Resources