matplotlib boxplot with split y-axis - python-3.x

I would like to make a box plot with data similar to this
d = {'Education': [1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4],
'Hours absent': [3, 100,5,7,2,128,4,6,7,1,2,118,2,4,136,1,1]}
df = pd.DataFrame(data=d)
df.head()
This works beautifully:
df.boxplot(column=['Hours absent'] , by=['Education'])
plt.ylim(0, 140)
plt.show()
But the outliers are far away, therefore I would like to split the y-axis.
But here the boxplot commands "column" and "by" are not accepted anymore. So instead of splitting the data by education, I only get one merged data point.
This is my code:
dfnew = df[['Hours absent', 'Education']] # In reality I take the different
columns from a much bigger dataset
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
ax1.boxplot(dfnew['Hours absent'])
ax1.set_ylim(40, 140)
ax2.boxplot(dfnew['Hours absent'])
ax2.set_ylim(0, 40)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax1.xaxis.tick_top()
ax1.tick_params(labeltop='off') # don't put tick labels at the top
ax2.xaxis.tick_bottom()
d = .015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
plt.show()
These are the things I tried (I always changed this both for the first and second subplot) and the errors I got.
ax1.boxplot(dfnew['Hours absent'],dfnew['Education'])
#The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
#a.any() or a.all().
ax1.boxplot(column=dfnew['Hours absent'], by=dfnew['Education'])#boxplot()
#got an unexpected keyword argument 'column'
ax1.boxplot(dfnew['Hours absent'], by=dfnew['Education']) #boxplot() got an
#unexpected keyword argument 'by'
I also tried to convert data into array for y axis and list for x axis:
data = df[['Hours absent']].as_matrix()
labels= list(df['Education'])
print(labels)
print(len(data))
print(len(labels))
print(type(data))
print(type(labels))
And I substituted in the plot command like this:
ax1.boxplot(x=data, labels=labels)
ax2.boxplot(x=data, labels=labels)
Now the error is ValueError: Dimensions of labels and X must be compatible.
But they are both 17 long, I don't understand what is going wrong here.

You are overcomplicating this, the code for breaking the Y-axis is independent of the code for plotting the boxplot. Nothing keeps you from using df.boxplot, it will add some labels and titles you do not want but that is easy to fix.
df.boxplot(column='Hours absent', by='Education', ax=ax1)
ax1.set_xlabel('')
ax1.set_ylim(ymin=90)
df.boxplot(column='Hours absent', by='Education', ax=ax2)
ax2.set_title('')
ax2.set_ylim(ymax=50)
fig.subplots_adjust(top=0.87)
Of course you can also use matplotlib's boxplot, as long as you provide the parameters it needs. According to the docstring it will make
a box and whisker plot for each column of x or each vector in
sequence x
Which means you have to do the "by" part yourself.
grouper = df.groupby('Education')['Hours absent']
x = [grouper.get_group(k) for k in grouper.groups]
ax1.boxplot(x)
ax1.set_ylim(ymin=90)
ax2.boxplot(x)
ax2.set_ylim(ymax=50)

Related

How to set axis ticks with non periodical increment in matplolib

I have a 2D array representing the efficiency of a process for a given set of parameters A and B. The parameter A along the columns changes periodically, starting from 0 to 225 with increment one. The problem is with the rows where the parameter was changed in the following order:
[16 ,18 ,20 ,21 ,22 ,23 ,24 ,25 ,26 ,27 ,28 ,29 ,30 ,31 ,32 ,33 ,35 ,40 ,45 ,50 ,55 ,60 ,65 ,70 ,75 ,80 ,85 ,90 ,95 ,100 ,105 ,110 ,115 ,120 ,125]
So even though the rows increase with increment one, they represent a non-uniform increment of the parameter B. What I need is to showcase the values of the parameter B on the y-axis. Using axes.set_yticks() does not give me what I am looking for, and I do understand why but I do not know how to solve it.
A minimum example:
# Define parameter B values
parb_increment = [16, 18, 20] + list(range(21,34)) + list(range(35,126,5))
print(len(parb_increment))
print(x.shape)
# Figure and axes
figure, axes = plt.subplots(figsize=(10, 8))
# Plotting
im = axes.imshow(x, aspect='auto',
origin="lower",
cmap='Blues',
interpolation='none',
extent=(0, x.shape[1], 0, parb_increment[-1]))
# Unsuccessful trial for yticks
axes.set_yticks(parb_increment, labels=parb_increment)
# Colorbar
cb = figure.colorbar(im, ax=axes)
The previous code gives the figure and output below, and you can see how the ticks are not only misplaced but also start from an incorrect position.
35
(35, 225)
The item that controls the width/height of each pixel is aspect. Unfortunately you can't make it variable. The aspect won't change even if you modify/update y-axis ticks. That's why in your example ticks are mis-aligned with the rows of pixels.
Therefore, the solution to your problem is to duplicate those rows that increment non-uniformly.
See example below:
import numpy as np
import matplotlib.pyplot as plt
# Generate fake data
x = np.random.random((3, 4))
# Create uniform x-ticks and non-uniform y-ticks
x_increment = np.arange(0, x.shape[1]+1, 1)
y_increment = np.arange(0, x.shape[0]+1, 1) * np.arange(0, x.shape[0]+1, 1)
# Plot the data
fig, ax = plt.subplots(figsize=(6, 10))
img = ax.imshow(
x,
extent=(
0, x.shape[1], 0, y_increment[-1]
)
)
fig.colorbar(img, ax=ax)
ax.set_xlim(0, x.shape[1])
ax.set_xticks(x_increment)
ax.set_ylim(0, y_increment[-1])
ax.set_yticks(y_increment);
This replicates your problem and produces the following outcome.
The solution
First, determine the number of repeats of each row in the array:
nr_of_repeats_per_row =np.diff(y_increment)
nr_of_repeats_per_row = nr_of_repeats_per_row[::-1]
You need to reverse the order as the top row in the image is the first row in the array and y_increments provide the difference between rows starting from the last row in the array.
Now you can repeat each row in the array a specific number of times:
x_extended = np.repeat(x, nr_of_repeats_per_row, axis=0)
Replot with the x_extended:
fig, ax = plt.subplots(figsize=(6, 10))
img = ax.imshow(
x_extended,
extent=(
0, x.shape[1], 0, y_increment[-1]
),
interpolation="none"
)
fig.colorbar(img, ax=ax)
ax.set_xlim(0, x.shape[1])
ax.set_xticks(x_increment)
ax.set_ylim(0, y_increment[-1])
ax.set_yticks(y_increment);
And you should get this.

Modify position of colorbar so that extend triangle is above plot

So, I have to make a bunch of contourf plots for different days that need to share colorbar ranges. That was easily made but sometimes it happens that the maximum value for a given date is above the colorbar range and that changes the look of the plot in a way I dont need. The way I want it to treat it when that happens is to add the extend triangle above the "original colorbar". It's clear in the attached picture.
I need the code to run things automatically, right now I only feed the data and the color bar range and it outputs the images, so the fitting of the colorbar in the code needs to be automatic, I can't add padding in numbers because the figure sizes changes depending on the area that is being asked to be plotted.
The reason why I need this behavior is because eventually I would want to make a .gif and I can't have the colorbar to move in that short video. I need for the triangle to be added, when needed, to the top (and below) without messing with the "main" colorbar.
Thanks!
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize, BoundaryNorm
from matplotlib import cm
###############
## Finds the appropriate option for variable "extend" in fig colorbar
def find_extend(vmin, vmax, datamin, datamax):
#extend{'neither', 'both', 'min', 'max'}
if datamin >= vmin:
if datamax <= vmax:
extend="neither"
else:
extend="max"
else:
if datamax <= vmax:
extend="min"
else:
extend="both"
return extend
###########
vmin=0
vmax=30
nlevels=8
colormap=cm.get_cmap("rainbow")
### Creating data
z_1=30*abs(np.random.rand(5, 5))
z_2=37*abs(np.random.rand(5, 5))
data={1:z_1, 2:z_2}
x=range(5)
y=range(5)
## Plot
for day in [1, 2]:
fig = plt.figure(figsize=(4,4))
## Normally figsize=get_figsize(bounds) and bounds is retrieved from gdf.total_bounds
## The function creates the figure size based on the x/y ratio of the bounds
ax = fig.add_subplot(1, 1, 1)
norm=BoundaryNorm(np.linspace(vmin, vmax, nlevels+1), ncolors=colormap.N)
z=data[day]
cs=ax.contourf(x, y, z, cmap=cmap, norm=norm, vmin=vmin, vmax=vmax)
extend=find_extend(vmin, vmax, np.nanmin(z), np.nanmax(z))
fig.colorbar(cm.ScalarMappable(norm=norm, cmap=cmap), ax=ax, extend=extend)
plt.close(fig)
You can do something like this: putting a triangle on top of the colorbar manually:
fig, ax = plt.subplots()
pc = ax.pcolormesh(np.random.randn(20, 20))
cb = fig.colorbar(pc)
trixy = np.array([[0, 1], [1, 1], [0.5, 1.05]])
p = mpatches.Polygon(trixy, transform=cb.ax.transAxes,
clip_on=False, edgecolor='k', linewidth=0.7,
facecolor='m', zorder=4, snap=True)
cb.ax.add_patch(p)
plt.show()

Legend overwritten by plot - matplotlib

I have a plot that looks as follows:
I want to put labels for both the lineplot and the markers in red. However the legend is not appearning because its the plot is taking out its space.
Update
it turns out I cannot put several strings in plt.legend()
I made the figure bigger by using the following:
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
However now I have only one label in the legend, with the marker appearing on the lineplot while I rather want two: one for the marker alone and another for the line alone:
Updated code:
plt.plot(range(len(y)), y, '-bD', c='blue', markerfacecolor='red', markeredgecolor='k', markevery=rare_cases, label='%s' % target_var_name)
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
# changed this over here
plt.legend()
plt.savefig(output_folder + fig_name)
plt.close()
What you want to do (have two labels for a single object) is not completely impossible but it's MUCH easier to plot separately the line and the rare values, e.g.
# boilerplate
import numpy as np
import matplotlib.pyplot as plt
# synthesize some data
N = 501
t = np.linspace(0, 10, N)
s = np.sin(np.pi*t)
rare = np.zeros(N, dtype=bool); rare[:20]=True; np.random.shuffle(rare)
plt.plot(t, s, label='Curve')
plt.scatter(t[rare], s[rare], label='rare')
plt.legend()
plt.show()
Update
[...] it turns out I cannot put several strings in plt.legend()
Well, you can, as long as ① the several strings are in an iterable (a tuple or a list) and ② the number of strings (i.e., labels) equals the number of artists (i.e., thingies) in the plot.
plt.legend(('a', 'b', 'c'))

How to subplot two alternate x scales and two alternate y scales for more than one subplot?

I am trying to make a 2x2 subplot, with each of the inner subplots consisting of two x axes and two y axes; the first xy correspond to a linear scale and the second xy correspond to a logarithmic scale. Before assuming this question has been asked before, the matplotlib docs and examples show how to do multiple scales for either x or y but not both. This post on stackoverflow is the closest thing to my question, and I have attempted to use this idea to implement what I want. My attempt is below.
Firstly, we initialize data, ticks, and ticklabels. The idea is that the alternate scaling will have the same tick positions with altered ticklabels to reflect the alternate scaling.
import numpy as np
import matplotlib.pyplot as plt
# xy data (global)
X = np.linspace(5, 13, 9, dtype=int)
Y = np.linspace(7, 12, 9)
# xy ticks for linear scale (global)
dtick = dict(X=X, Y=np.linspace(7, 12, 6, dtype=int))
# xy ticklabels for linear and logarithmic scales (global)
init_xt = 2**dtick['X']
dticklabel = dict(X1=dtick['X'], Y1=dtick['Y']) # linear scale
dticklabel['X2'] = ['{}'.format(init_xt[idx]) if idx % 2 == 0 else '' for idx in range(len(init_xt))] # log_2 scale
dticklabel['Y2'] = 2**dticklabel['Y1'] # log_2 scale
Borrowing from the linked SO post, I will plot the same thing in each of the 4 subplots. Since similar methods are used for both scalings in each subplot, the method is thrown into a for-loop. But we need the row number, column number, and plot number for each.
# 2x2 subplot
# fig.add_subplot(row, col, pnum); corresponding iterables = (irows, icols, iplts)
irows = (1, 1, 2, 2)
icols = (1, 2, 1, 2)
iplts = (1, 2, 1, 2)
ncolors = ('red', 'blue', 'green', 'black')
Putting all of this together, the function to output the plot is below:
def initialize_figure(irows, icols, iplts, ncolors, figsize=None):
""" """
fig = plt.figure(figsize=figsize)
for row, col, pnum, color in zip(irows, icols, iplts, ncolors):
ax1 = fig.add_subplot(row, col, pnum) # linear scale
ax2 = fig.add_subplot(row, col, pnum, frame_on=False) # logarithmic scale ticklabels
ax1.plot(X, Y, '-', color=color)
# ticks in same positions
for ax in (ax1, ax2):
ax.set_xticks(dtick['X'])
ax.set_yticks(dtick['Y'])
# remove xaxis xtick_labels and labels from top row
if row == 1:
ax1.set_xticklabels([])
ax2.set_xticklabels(dticklabel['X2'])
ax1.set_xlabel('')
ax2.set_xlabel('X2', color='gray')
# initialize xaxis xtick_labels and labels for bottom row
else:
ax1.set_xticklabels(dticklabel['X1'])
ax2.set_xticklabels([])
ax1.set_xlabel('X1', color='black')
ax2.set_xlabel('')
# linear scale on left
if col == 1:
ax1.set_yticklabels(dticklabel['Y1'])
ax1.set_ylabel('Y1', color='black')
ax2.set_yticklabels([])
ax2.set_ylabel('')
# logarithmic scale on right
else:
ax1.set_yticklabels([])
ax1.set_ylabel('')
ax2.set_yticklabels(dticklabel['Y2'])
ax2.set_ylabel('Y2', color='black')
ax1.tick_params(axis='x', colors='black')
ax1.tick_params(axis='y', colors='black')
ax2.tick_params(axis='x', colors='gray')
ax2.tick_params(axis='y', colors='gray')
ax1.xaxis.tick_bottom()
ax1.yaxis.tick_left()
ax1.xaxis.set_label_position('top')
ax1.yaxis.set_label_position('right')
ax2.xaxis.tick_top()
ax2.yaxis.tick_right()
ax2.xaxis.set_label_position('top')
ax2.yaxis.set_label_position('right')
for ax in (ax1, ax2):
ax.set_xlim([4, 14])
ax.set_ylim([6, 13])
fig.tight_layout()
plt.show()
plt.close(fig)
Calling initialize_figure(irows, icols, iplts, ncolors) produces the figure below.
I am applying the same xlim and ylim so I do not understand why the subplots are all different sizes. Also, the axis labels and axis ticklabels are not in the specified positions (since fig.add_subplot(...) indexing starts from 1 instead of 0.
What is my mistake and how can I achieve the desired result?
(In case it isn't clear, I am trying to put the xticklabels and xlabels for the linear scale on the bottom row, the xticklabels and xlabels for the logarithmic scale on the top row, the 'yticklabelsandylabelsfor the linear scale on the left side of the left column, and the 'yticklabels and ylabels for the logarithmic scale on the right side of the right column. The color='black' kwarg corresponds to the linear scale and the color='gray' kwarg corresponds to the logarithmic scale.)
The irows and icols lists inn the code do not serve any purpose. To create 4 subplots in a 2x2 grid you would loop over the range(1,5),
for pnum in range(1,5):
ax1 = fig.add_subplot(2, 2, pnum)
This might not be the only problem in the code, but as long as the subplots aren't created correctly it's not worth looking further down.

Make subplots of the histogram in pandas dataframe using matpolot library?

I have the following data separate by tab:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
1 1,2 60,6 2820,81 2 66
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
3 5,3,4,6 6,12,14,17 908,394,759,115656 4 49
4 17,18,22,16,19,21,20 22,11,3,16,7,12,6 1463,171,149,256,157,388,195 7 77
5 13,15,12,14 56,25,96,107 2600821,858,5666,1792 4 284
7 24,26,29,25,27,23,30,28,31 12,31,19,6,12,23,9,37,25 968,3353,489,116,523,1933,823,2655,331 9 174
8 33,32 53,35 1603,2991338 2 88
I am using this code to build a histogram plots with subplots for each CHROM:
with open(outputdir + '/' + 'hap_size_byVar_'+ soi +'_'+ prefix+'.png', 'wb') as fig_initial:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
for i, data in hap_stats.iterrows():
# first convert data to list of integers
data_i = [int(x) for x in data['num_Vars_by_PI'].split(',')]
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
ax[i].legend()
plt.xlabel('size of the haplotype (number of variants)')
plt.ylabel('frequency of the haplotypes')
plt.suptitle('histogram of size of the haplotype (number of variants) \n'
'for each chromosome')
plt.savefig(fig_initial)
Everything is fine except two problems:
The Y-label frequency of the haplotypes is not adjusted properly in this output plot.
When the data contain only one row (see data below) the subplot are not possible and I get TypeError, even though it should be able to make the subgroup with only one index.
Dataframe with only one line of data:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
TypeError :
Traceback (most recent call last):
File "phase-Extender.py", line 1806, in <module>
main()
File "phase-Extender.py", line 502, in main
compute_haplotype_stats(initial_haplotype, soi, prefix='initial')
File "phase-Extender.py", line 1719, in compute_haplotype_stats
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
TypeError: 'AxesSubplot' object does not support indexing
How can I fix these two issues ?
Your first problem comes from the fact that you are using plt.ylabel() at the end of your loop. pyplot functions act on the current active axes object, which, in this case, is the last one created by subplots(). If you want your label to be centered over your subplots, the easiest might be to create a text object centered vertically in the figure.
# replace plt.ylabel('frequency of the haplotypes') with:
fig.text(.02, .5, 'frequency of the haplotypes', ha='center', va='center', rotation='vertical')
you can play around with the x-position (0.02) until you find a position you're happy with. The coordinates are in figure coordinates, (0,0) is bottom left (1,1) is top right. Using 0.5 as y position ensures the label is centered in the figure.
The second problem is due to the fact that, when numrows=1 plt.subplots() returns directly the axes object, instead of a list of axes. There are two options to circumvent this problem
1 - test whether you have only one line, and then replace ax with a list:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
if len(hap_stats)==1:
ax = [ax]
(...)
2 - use the option squeeze=False in your call to plt.subplots(). As explained in the documentation, using this option will force subplots()to always return a 2D array. Therefore you'll have to modify a bit how you are indexing your axes:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True, squeeze=False)
for i, data in hap_stats.iterrows():
(...)
ax[i,0].hist(data_i, label=str(data['CHROM']), alpha=0.5)
(...)

Resources