box plot: whisker definition in pandas and matplotlib - python-3.x

From https://en.wikipedia.org/wiki/Box_plot
The whisker of the box plot has the following possible definitions:
the minimum and maximum of all of the data[1]
the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
I am wondering in the pandas:
df['data'].plot(kind = 'box', sym='bD')
which definition is the whisker using?
Also, for the matplotlib library:
ax.boxplot(dfa.duration)
which definition is the whisker using?
Thanks!

The boxplot documentaton says about the whiskers
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the whiskers to the beyond the first and third quartiles. In other words, where IQR is the interquartile range (Q3-Q1), the upper whisker will extend to last datum less than Q3 + whisIQR). Similarly, the lower whisker will extend to the first datum greater than Q1 - whisIQR. Beyond the whiskers, data are considered outliers and are plotted as individual points. Set this to an unreasonably high value to force the whiskers to show the min and max values. Alternatively, set this to an ascending sequence of percentile (e.g., [5, 95]) to set the whiskers at specific percentiles of the data. Finally, whis can be the string 'range' to force the whiskers to the min and max of the data.
The only definition from the list from the question which cannot be easily implemented is the "one standard deviation", all others are readily set with this argument. The default is the 1.5IQR definition.
The pandas.DataFrame.boxplot calls the matplotlib function. Hence they should be identical.

Related

How to plot 2 arrays against each other according to their Manhatten distance?

I have 4 arrays namely a,b,c and d. The plotting has to be done between the each pair of values between a,b and c,d in the same plot and preferrably with different markers.The color scheme has to vary as per the manhatten distance. I did the following -
a=[1,2,3]
b=[5,6,7]
c=[9,10,11]
d=[13,14,15]
distance1=[30,40,50]
distance2=[10,20,60]
The pairs to be plotted are (1,5),(2,6),(3,7) for a and b according to the distance array distance1. The other pairs are (9,10),(10,14),(11,15) for c and d according to the distance array distance 2.
I have written the following code
import matplotlib.pyplot as plt
plt.scatter(a,b,c=distance1,marker='^',cmap='jet', lw=0,label='ab')
plt.scatter(c,d,c=distance2,marker='*',cmap='jet', lw=0,label='cd')
Thought he plots are getting plotted correctly, the colors are improperly assigned or getting overwritten when second time the plt.scatter is called. How to address this issue?

How to normalize samples of an ongoing cumulative sum?

For simplicity let's assume we have a function sin(x) and calculated 1000 samples between -1 and 1 with it. We can plot those samples. Now in the next step we want to plot the integral of sin(x) which would be - cos(x) + C. Now i can calculate the integral with my existing samples like this:
y[n] = x[n] + y[n-1]
Because it's a cumulative sum we will need to normalize it to get samples between -1 and 1 on the y axis.
y = 2 * ( x - min(x) / max(x) - min(x) ) - 1
To normalize we need a maximum and a minimum.
Now we want to calculate the next 1000 samples for sin(x) and calculate the integral again. Because it's a cumulative sum we will have a new maximum which means we will need to normalize all of our 2000 samples.
Now my question basically is:
How can i normalize samples in this context without knowing the maximum and minimum?
How can i prevent, to normalize all previous samples again, if i have a new set of samples with a new maximum/minimum?
I've found a solution :)
I also want to mention: This is about periodic functions like Sine, so basically the maximum and minimum should be always the same, right?
In a special case this isn't true:
If you samples don't contain a full period of the function (with global maximum and minimum of the function). This can happen when you choose a very low frequency.
What can you do:
Simply calculate the samples for a function like sin(x) with a
frequency of 1. It will contain the global maximum and minimum of the function (it's important that y varies between -1 and 1, not 0 and 1!).
Then you calculate the integral with the cumulative sum.
get maximum and minimum of the samples
you can scale it up or down: maximum/frequency, minimum/frequency
can be used now to normalize samples which were calculated with any other frequency.
It only need to be calculated once at the beginning.

Finding parameters values of growth function?

Number of days before vaccination (x) bacteria count (1000 pieces) (y)
1 112
2 148
3 241
4 363
5 585
I Need to find 2 things
first calculate with growth function third day count and I have been counted.
=GROWTH(I3:I4;H3:H4;H5)
But I need to calculate parameters of growth function( 𝑌=𝑎.𝑏^𝑋)
So how to calculate a and b? I tried to use excel solver but i didn't solve
Seems like LOGEST is designed for what you want:
the LOGEST function calculates an exponential curve that fits your
data and returns an array of values that describes the curve. Because
this function returns an array of values, it must be entered as an
array formula.
Note that there is a difference in how the equation is expressed on an x-y chart with an exponential trendline, and by the function. On the chart, m is expressed as a power of e, so to convert the value returned by the formula to be the same as what is seen on the chart, you would do something like:
=LN(INDEX(LOGEST(known_y,known_x),1))
You are dealing with an exponentional growth, you want to describe. The basic way to handle this, is to take the logarythm of the whole thing, and apply linear regression on that, using the Linest() function.

Fitting Axis and of Y and X in Excel

I have the following Graph: The Y values are located at X= 32,64,128,256, 512 and 1024. However, the graph shows different values. I would like to show for X-axis labels only the relevant values (i.e.32,64,128,256, 512 and 1024).
In addition, I would like to add the maximal value of 1 to Y-axis. As can be seen I defined the maximal value to be 1 but the graph doesn't show it.
How can I fix these 2 issues both in X-axis and in Y-axis?
For the X-axis: tick the check box "Logarithmic scale" and set the Base to 2.
For the Y-axis: set the Minimum to a value that is divisible by the Major unit 0.1, for example to 0.4.
Thanks to Hans Vogelaar (http://www.eileenslounge.com) for the answer.

Statistical correlation: Pearson or Spearman?

I have 2 series of 45 values in the interval [0,1]. The first series is a human-generated standard, the second one is computer-generated (full series here http://www.copypastecode.com/74844/).
The first series is sorted decreasingly.
0.909090909 0.216196598
0.909090909 0.111282099
0.9 0.021432587
0.9 0.033901106
...
0.1 0.003099256
0 0.001084533
0 0.008882249
0 0.006501463
Now what I want to assess is the degree to which the order is preserved in the second series, given that the first series is monotonic.
The Pearson correlation is 0.454763067, but I think that the relationship is not linear so this value is difficult to interpret.
A natural approach would be to use the Spearman rank correlation, which in this case is 0.670556181.
I noticed that with random values, while Pearson is very close to 0, the Spearman rank correlation goes up to 0.5, so a value of 0.67 seems very low.
What would you use to assess the order similarity between these 2 series?
I want to assess is the degree to which the order is preserved
Since it's the order (rank) that you care about, Spearman rank correlation is the more meaningful metric here.
I noticed that with random values [...] the Spearman rank correlation goes up to 0.5
How do you generate those random values? I've just conducted a simple experiment with some random numbers generated using numpy, and I am not seeing that:
In [1]: import numpy as np
In [2]: import scipy.stats
In [3]: x = np.random.randn(1000)
In [4]: y = np.random.randn(1000)
In [5]: print scipy.stats.spearmanr(x, y)
(-0.013847401847401847, 0.66184551507218536)
The first number (-0.01) is the rank correlation coefficient; the second number (0.66) is the associated p-value.

Resources