In one of the research paper I follow they have said "Classes have been derived from scores with a median split"
Can anyone please explain if this median split is same as median? Thank you :)
A median split is when a set of elements is dichotomised (i.e. split into two) according to the statistical median (50th percentile). One group will contain all elements greater than the median and the other group will contain all elements less than the median.
So if you have a series of numbers (e.g. from 1 to 6) and you do a median split (with the median being 3.5 on this occasion) you will essentially split the series of numbers into two groups:
Group a would be 1, 2, 3
and
Group b would be 4, 5, 6
You can see another example here:
For example, we administer a scale of Optimism, and then use a median-split to label the people above the median as Optimists, and those below the median as Pessimists.
Essentially, the median split is not the median itself but it is a technique that uses the median to perform a split on a set of elements.
Related
When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.
I work in the oil & gas industry and I'm seeking advice about how to calculate the minimum distance between a set of wells (the wells are drawn as straight lines on a map). My goal is for each individual well to have a unique "spacing" value (measured in feet) which is basically the straight-line horizontal distance to the closest wellbore on a map. Below is a simple example of what I'm trying to accomplish (assume the pipe | symbol is a wellbore and the dashes are the distance between the wells)
|--|---|-|
In the drawing above we have 4 wells. The 1st well (starting from the far left) would have a spacing value of 2 (since there are 2 dashes to the closest well), the 2nd well would also have a value of 2 (since the closest well is the one to the far left which is two spaces away), the 3rd well would have a value of 1, and the 4th well would have a value of 1.
Now imagine that I have hundreds of these wells (each with latitude/longitude points that describe the start & end points of each well) and I have them all mapped in TIBCO Spotfire (scattered across Texas). Do you guys know if it would even be possible to automate a calculation like the above? I would also like to build in a rule that says the max distance between wells is 2640 ft (half of a mile).
Any ideas are appreciated!
I think you should be able to do this without any R or iron python.
Within Spotfire, you can calculate the distance in miles between 2 points using the formula below (substitute 6371 for 3958.756 to get the answer in kilometres).
GreatCircleDistance([Lat 1],[Lon 1],[Lat 2],[Lon 2]) * 3958.756
For your use case, you could cross join your table of locations, so that you have a row for every possible location combination, then calculate the distance between them using the formula above. After that, it should be pretty straight forward to find each wells closest pair.
I have been looking around for a while and have not been able to find answers to the following issue that relate to subsetting.
I currently have many lines on a line graph and would only like 4 lines which demonstrate:
the lower quartile
the median
upper quartile
one additional line of the category that I choose
The amounts need to be dynamic and change according to other filters that I put on the graph
Box Whisker Plot gives Median,Upper Quartile and Lower Quartile values
You can also drag drag reference lines and bands into the view from the analysis tab
I have a scatterplot with values that range from 2 to -2. The catch is that 1 is the "zero-point". In other words the minimum positive value is 1.01 and the minimum negative value is -1.01. How can I edit the axis of the graph so that 0 is replaced with 1.
If you have a version of Excel later than 2007 I'd suggest splitting the positives and negatives into separate series and plotting one on a secondary axis (not that I know whether or not that would work!), but with 2007 I have not been able to place one vertical axis above the horizontal and another below. Instead the best I could manage was to use two separate charts:
by again splitting up the series, careful positioning and judicious use of a text box for 0.
At least this way you are not constrained by the outer limits.
Based on some very specific conditions you can print the zero-point as 1 using a custom number format: You have to set the axis options to be fixed at -2 (minimum) and 2 (maximum) with a major unit of 2 as well. This ensure that you only have the three values -2, 0 and 2 on the vertical/y-axis. Why is this important? Well, custom number formats can easily distinguish between positive/negative and zero values which is exactly what you have when you have -2, 0 and 2.
Here's a visual of the input/output:
The custom number format is set to 2;-2;1, thereby formatting all positive numbers to 2, all negative numbers to -2 and zero to 1.
If all you want is to replace the axis label "0" with "1" (as in the answer by Werner), then you can use the following (similar to this):
Add X and Y values for a dummy series, with 3 points. If the minimum value in your X-axis is xm, your points are (xm, -2), (xm, 0), (xm, 2).
Add cells with the 3 labels that you will use for the dummy series: "-2", "1", "2".
Go to the chart, and remove the tick labels of the Y-axis.
Add a series with the 3 dummy data points.
Add the labels to the data points. You can use references to the cells of item 2, or enter explicit labels. Entering each label (either a reference or an explicit label) is tedious when you have many data points. Check this, and in particular Rob Bovey´s add-in. It is excellent.
Format the dummy series so it is visually ok (e.g., small, hairline crosses, no line).
You can use variations on this. For instance, you can add extra points to your dummy series, with corresponding labels. Gridlines would match the dummy series.
But I think this is not appropriate, as the locations of your data points will be inconsistent with the scales.
What is appropriate is having an interrupted axis, where the interval (-1,1) is eliminated.
The answer by pnuts aims at that.
I propose something different, with the advantage of using only one chart:
Create a column where you add 2, only to negative Y-values. Use that column as your new Y-values.
Use the same trick as above, with your dummy series now being (xm, 0), (xm, 1), (xm, 2), and the labels the same as above.
You can use additional points in your dummy series.
You can use this technique to create an arbitrary number of axis interruptions. The formula for the "fake" Y-values would be more complicated, with IFs to detect the interval corresponding to each point, and suitable linear transformations to account for the change in scale for each interval (assuming linear scales; no mixing linear-log). But that is all.
PS: see also the links below. I still think my alternative is better.
http://peltiertech.com/broken-y-axis-in-excel-chart/
http://ksrowell.com/blog-visualizing-data/2013/08/12/how-to-simulate-a-broken-axis-value-axis/
http://www.tushar-mehta.com/excel/newsgroups/broken_y_axis/tutorial/index.html#Rescale%20and%20hide%20the%20y-axis
I have sorted array of real values, say X, drawn from some unknown distribution. I would like draw a box plot for this data.
In the simplest case, I need to know five values: min, Q1, median, Q3, and max.
Trivially, min = X[0], max = X[length(X)-1], and possibly median = X[ceil(length(X)/2)]. But I'm wondering how to determine the lower quartile Q1 and Q3.
When I plot X = [1,2,4] using MATLAB, I obtain following result:
It seems to me like there is some magic how to obtain the values Q1 = 1.25 and Q3 = 3.5, but I don't know what the magic is. Does anybody have experience with this?
If you go to the original definition of box plots (look up John Tukey), you use the median for the midpoint (i.e., 2 in your data set of 1, 2, 4). The endpoints are the min and max.
The top and bottom of the box are not exactly defined by quartiles, instead they are called "hinges". Hinges are the medians of the top and bottom halves of the data. If there is an odd number of observations, the median of the entire set is used in determining both hinges. The lower hinge is the median of (1,2), or 1.5. The top hinge is the median of (2,4), or 3.
There are actually dozens of definitions of a box plot's quartiles (Wikipedia: "There is no universal agreement on choosing the quartile values"). If you want to rationalize MatLab's box plot, you'll have to check its documentation. Otherwise, you could Google your brains out to try to find a method that matches the results.
Minitab gives 1 and 4 for the hinges in your data set. Excel's PERCENTILE function gives 1.5 and 3, which incidentally matches Tukey's algorithm at least in this case.
The median devides the data into two halves. The median of the first half = Q1, and the median of the second half = Q3.
More info: http://www.purplemath.com/modules/boxwhisk.htm
Note on the MatLab boxplot: The Q1 and Q3 are maybe calculated in a different way in MatLab, I'd try with a larger amount of testing data. With my method, Q1 should be 1 and Q3 should be 4.
EDIT:
The possible calculation that MatLab does, is the difference between the median and the first number of the first half, and take a quarter of that. Add that to the first number to get Q1.
The same (roughly) applies to Q3: Take the difference between the median and the highest number, and subtract a quarter of that from the highest number. That is Q3.