Make subplots of the histogram in pandas dataframe using matpolot library? - python-3.x

I have the following data separate by tab:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
1 1,2 60,6 2820,81 2 66
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
3 5,3,4,6 6,12,14,17 908,394,759,115656 4 49
4 17,18,22,16,19,21,20 22,11,3,16,7,12,6 1463,171,149,256,157,388,195 7 77
5 13,15,12,14 56,25,96,107 2600821,858,5666,1792 4 284
7 24,26,29,25,27,23,30,28,31 12,31,19,6,12,23,9,37,25 968,3353,489,116,523,1933,823,2655,331 9 174
8 33,32 53,35 1603,2991338 2 88
I am using this code to build a histogram plots with subplots for each CHROM:
with open(outputdir + '/' + 'hap_size_byVar_'+ soi +'_'+ prefix+'.png', 'wb') as fig_initial:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
for i, data in hap_stats.iterrows():
# first convert data to list of integers
data_i = [int(x) for x in data['num_Vars_by_PI'].split(',')]
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
ax[i].legend()
plt.xlabel('size of the haplotype (number of variants)')
plt.ylabel('frequency of the haplotypes')
plt.suptitle('histogram of size of the haplotype (number of variants) \n'
'for each chromosome')
plt.savefig(fig_initial)
Everything is fine except two problems:
The Y-label frequency of the haplotypes is not adjusted properly in this output plot.
When the data contain only one row (see data below) the subplot are not possible and I get TypeError, even though it should be able to make the subgroup with only one index.
Dataframe with only one line of data:
CHROM ms02g:PI num_Vars_by_PI range_of_PI total_haplotypes total_Vars
2 9,8,10,7,11 94,78,10,69,25 89910,1102167,600,1621365,636 5 276
TypeError :
Traceback (most recent call last):
File "phase-Extender.py", line 1806, in <module>
main()
File "phase-Extender.py", line 502, in main
compute_haplotype_stats(initial_haplotype, soi, prefix='initial')
File "phase-Extender.py", line 1719, in compute_haplotype_stats
ax[i].hist(data_i, label=str(data['CHROM']), alpha=0.5)
TypeError: 'AxesSubplot' object does not support indexing
How can I fix these two issues ?

Your first problem comes from the fact that you are using plt.ylabel() at the end of your loop. pyplot functions act on the current active axes object, which, in this case, is the last one created by subplots(). If you want your label to be centered over your subplots, the easiest might be to create a text object centered vertically in the figure.
# replace plt.ylabel('frequency of the haplotypes') with:
fig.text(.02, .5, 'frequency of the haplotypes', ha='center', va='center', rotation='vertical')
you can play around with the x-position (0.02) until you find a position you're happy with. The coordinates are in figure coordinates, (0,0) is bottom left (1,1) is top right. Using 0.5 as y position ensures the label is centered in the figure.
The second problem is due to the fact that, when numrows=1 plt.subplots() returns directly the axes object, instead of a list of axes. There are two options to circumvent this problem
1 - test whether you have only one line, and then replace ax with a list:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True)
if len(hap_stats)==1:
ax = [ax]
(...)
2 - use the option squeeze=False in your call to plt.subplots(). As explained in the documentation, using this option will force subplots()to always return a 2D array. Therefore you'll have to modify a bit how you are indexing your axes:
fig, ax = plt.subplots(nrows=len(hap_stats), sharex=True, squeeze=False)
for i, data in hap_stats.iterrows():
(...)
ax[i,0].hist(data_i, label=str(data['CHROM']), alpha=0.5)
(...)

Related

How to add count on top of bars in seaborn catplot? [duplicate]

This question already has answers here:
How to add value labels on a bar chart
(7 answers)
Closed 8 months ago.
I have a dataframe that looks like:
User A B C
ABC 100 121 OPEN
BCD 200 255 CLOSE
BCD 500 134 OPEN
DEF 600 125 CLOSE
ABC 900 632 OPEN
ABC 150 875 CLOSE
DEF 690 146 OPEN
I am trying to display a countplot on column 'User'. The code is as follows:
fig, ax1 = plt.subplots(figsize=(20,10))
graph = sns.countplot(ax=ax1,x='User', data=df)
graph.set_xticklabels(graph.get_xticklabels(),rotation=90)
for p in graph.patches:
height = p.get_height()
graph.text(p.get_x()+p.get_width()/2., height + 0.1,
'Hello',ha="center")
The output looks like:
However, I want to replace string 'Hello' with the value_counts of column 'User'. When I add the code to add label to graph :
for p in graph.patches:
height = p.get_height()
graph.text(p.get_x()+p.get_width()/2., height + 0.1,
df['User'].value_counts(),ha="center")
I get the output as:
New in matplotlib 3.4.0
We can now automatically annotate bar plots with the built-in Axes.bar_label, so all we need to do is access/extract the seaborn plot's Axes.
Seaborn offers several ways to plot counts, each with slightly different count aggregation and Axes handling:
seaborn.countplot (most straightforward)
This automatically aggregates counts and returns an Axes, so just directly label ax.containers[0]:
ax = sns.countplot(x='User', data=df)
ax.bar_label(ax.containers[0])
seaborn.catplot (kind='count')
This plots a countplot onto a facet grid, so extract the Axes from the grid before labeling ax.containers[0]:
g = sns.catplot(x='User', kind='count', data=df)
for ax in g.axes.flat:
ax.bar_label(ax.containers[0])
seaborn.barplot
This returns an Axes but does not aggregate counts, so first compute Series.value_counts before labeling ax.containers[0]:
counts = df['User'].value_counts().rename_axis('user').reset_index(name='count')
ax = sns.barplot(x='user', y='count', data=counts)
ax.bar_label(ax.containers[0])
If you are using hue:
hue plots will contain multiple bar containers, so ax.containers will need to be iterated:
ax = sns.countplot(x='User', hue='C', data=df)
for container in ax.containers:
ax.bar_label(container)
df['User'].value_counts() will return a Series containing counts of unique values of the column User.
Without analyzing in much detail your code, you could correct it by indexing the result of value_counts with a counter:
fig, ax1 = plt.subplots(figsize=(20,10))
graph = sns.countplot(ax=ax1,x='User', data=df)
graph.set_xticklabels(graph.get_xticklabels(),rotation=90)
i=0
for p in graph.patches:
height = p.get_height()
graph.text(p.get_x()+p.get_width()/2., height + 0.1,
df['User'].value_counts()[i],ha="center")
i += 1
With your sample data, it produces the following plot:
As suggested by #ImportanceOfBeingErnest, the following code produces the same output with simpler code, using the height variable itself instead of the value_counts indexed:
fig, ax1 = plt.subplots(figsize=(20,10))
graph = sns.countplot(ax=ax1,x='User', data=df)
graph.set_xticklabels(graph.get_xticklabels(),rotation=90)
for p in graph.patches:
height = p.get_height()
graph.text(p.get_x()+p.get_width()/2., height + 0.1,height ,ha="center")
other solution
#data
labels=data['Sistema Operativo'].value_counts().index
values=data['Sistema Operativo'].value_counts().values
plt.figure(figsize = (15, 8))
ax = sns.barplot(x=labels, y=values)
for i, p in enumerate(ax.patches):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2., height + 0.1, values[i],ha="center")
Chart Image
Note: This solution does not try to show the count on top of the bar. Instead, this simple solution will print the values inside the bar. This may be an elegant solution for some occasions.
import seaborn as sns
ax=sns.countplot(x=df['category'], data=df);
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=18)

Legend overwritten by plot - matplotlib

I have a plot that looks as follows:
I want to put labels for both the lineplot and the markers in red. However the legend is not appearning because its the plot is taking out its space.
Update
it turns out I cannot put several strings in plt.legend()
I made the figure bigger by using the following:
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
However now I have only one label in the legend, with the marker appearing on the lineplot while I rather want two: one for the marker alone and another for the line alone:
Updated code:
plt.plot(range(len(y)), y, '-bD', c='blue', markerfacecolor='red', markeredgecolor='k', markevery=rare_cases, label='%s' % target_var_name)
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
# changed this over here
plt.legend()
plt.savefig(output_folder + fig_name)
plt.close()
What you want to do (have two labels for a single object) is not completely impossible but it's MUCH easier to plot separately the line and the rare values, e.g.
# boilerplate
import numpy as np
import matplotlib.pyplot as plt
# synthesize some data
N = 501
t = np.linspace(0, 10, N)
s = np.sin(np.pi*t)
rare = np.zeros(N, dtype=bool); rare[:20]=True; np.random.shuffle(rare)
plt.plot(t, s, label='Curve')
plt.scatter(t[rare], s[rare], label='rare')
plt.legend()
plt.show()
Update
[...] it turns out I cannot put several strings in plt.legend()
Well, you can, as long as ① the several strings are in an iterable (a tuple or a list) and ② the number of strings (i.e., labels) equals the number of artists (i.e., thingies) in the plot.
plt.legend(('a', 'b', 'c'))

How to modify scatter-plot figure legend to show different formats for the same types of handles?

I am trying to modify the legend of a figure that contains two overlayed scatter plots. More specifically, I want two legend handles and labels: the first handle will contain multiple points (each colored differently), while the other handle consists of a single point.
As per this related question, I can modify the legend handle to show multiple points, each one being a different color.
As per this similar question, I am aware that I can change the number of points shown by a specified handle. However, this applies the change to all handles in the legend. Can it be applied to one handle only?
My goal is to combine both approaches. Is there a way to do this?
In case it isn't clear, I would like to modify the embedded figure (see below) such that Z vs X handle shows only one-point next to the corresponding legend label, while leaving the Y vs X handle unchanged.
My failed attempt at producing such a figure is below:
To replicate this figure, one can run the code below:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerTuple, HandlerRegularPolyCollection
class ScatterHandler(HandlerRegularPolyCollection):
def update_prop(self, legend_handle, orig_handle, legend):
""" """
legend._set_artist_props(legend_handle)
legend_handle.set_clip_box(None)
legend_handle.set_clip_path(None)
def create_collection(self, orig_handle, sizes, offsets, transOffset):
""" """
p = type(orig_handle)([orig_handle.get_paths()[0]], sizes=sizes, offsets=offsets, transOffset=transOffset, cmap=orig_handle.get_cmap(), norm=orig_handle.norm)
a = orig_handle.get_array()
if type(a) != type(None):
p.set_array(np.linspace(a.min(), a.max(), len(offsets)))
else:
self._update_prop(p, orig_handle)
return p
x = np.arange(10)
y = np.sin(x)
z = np.cos(x)
fig, ax = plt.subplots()
hy = ax.scatter(x, y, cmap='plasma', c=y, label='Y vs X')
hz = ax.scatter(x, z, color='k', label='Z vs X')
ax.grid(color='k', linestyle=':', alpha=0.3)
fig.subplots_adjust(bottom=0.2)
handler_map = {type(hz) : ScatterHandler()}
fig.legend(mode='expand', ncol=2, loc='lower center', handler_map=handler_map, scatterpoints=5)
plt.show()
plt.close(fig)
One solution that I do not like is to create two legends - one for Z vs X and one for Y vs X. But, my actual use case involves an optional number of handles (which can exceed two) and I would prefer not having to calculate the optimal width/height of each legend box. How else can this problem be approached?
This is a dirty trick and not an elegant solution, but you can set the sizes of other points for Z-X legend to 0. Just change your last two lines to the following.
leg = fig.legend(mode='expand', ncol=2, loc='lower center', handler_map=handler_map, scatterpoints=5)
# The third dot of the second legend stays the same size, others are set to 0
leg.legendHandles[1].set_sizes([0,0,leg.legendHandles[1].get_sizes()[2],0,0])
The result is as shown.

matplotlib: controlling position of y axis label with multiple twinx subplots

I wrote a Python script based on matplotlib that generates curves based on a common timeline. The number of curves sharing the same x axis in my plot can vary from 1 to 6 depending on user options.
Each of the data plotted use different y scales and require a different axis for drawing. As a result, I may need to draw up to 5 different Y axes on the right of my plot. I found the way in some other post to offset the position of the axes as I add new ones, but I still have two issues:
How to control the position of the multiple axes so that the tick labels don't overlap?
How to control the position of each axis label so that it is placed vertically at the bottom of each axis? And how to preserve this alignment as the display window is resized, zoomed-in etc...
I probably need to write some code that will first query the position of the axis and then a directive that will place the label relative to that position but I really have no idea how to do that.
I cannot share my entire code because it is too big, but I derived it from the code in this example. I modified that example by adding one extra plot and one extra axis to more closely match what intend to do in my script.
import matplotlib.pyplot as plt
def make_patch_spines_invisible(ax):
ax.set_frame_on(True)
ax.patch.set_visible(False)
for sp in ax.spines.values():
sp.set_visible(False)
fig, host = plt.subplots()
fig.subplots_adjust(right=0.75)
par1 = host.twinx()
par2 = host.twinx()
par3 = host.twinx()
# Offset the right spine of par2. The ticks and label have already been
# placed on the right by twinx above.
par2.spines["right"].set_position(("axes", 1.2))
# Having been created by twinx, par2 has its frame off, so the line of its
# detached spine is invisible. First, activate the frame but make the patch
# and spines invisible.
make_patch_spines_invisible(par2)
# Second, show the right spine.
par2.spines["right"].set_visible(True)
par3.spines["right"].set_position(("axes", 1.4))
make_patch_spines_invisible(par3)
par3.spines["right"].set_visible(True)
p1, = host.plot([0, 1, 2], [0, 1, 2], "b-", label="Density")
p2, = par1.plot([0, 1, 2], [0, 3, 2], "r-", label="Temperature")
p3, = par2.plot([0, 1, 2], [50, 30, 15], "g-", label="Velocity")
p4, = par3.plot([0,0.5,1,1.44,2],[100, 102, 104, 108, 110], "m-", label="Acceleration")
host.set_xlim(0, 2)
host.set_ylim(0, 2)
par1.set_ylim(0, 4)
par2.set_ylim(1, 65)
host.set_xlabel("Distance")
host.set_ylabel("Density")
par1.set_ylabel("Temperature")
par2.set_ylabel("Velocity")
par3.set_ylabel("Acceleration")
host.yaxis.label.set_color(p1.get_color())
par1.yaxis.label.set_color(p2.get_color())
par2.yaxis.label.set_color(p3.get_color())
par3.yaxis.label.set_color(p4.get_color())
tkw = dict(size=4, width=1.5)
host.tick_params(axis='y', colors=p1.get_color(), **tkw)
par1.tick_params(axis='y', colors=p2.get_color(), **tkw)
par2.tick_params(axis='y', colors=p3.get_color(), **tkw)
par3.tick_params(axis='y', colors=p4.get_color(), **tkw)
host.tick_params(axis='x', **tkw)
lines = [p1, p2, p3, p4]
host.legend(lines, [l.get_label() for l in lines])
# fourth y axis is not shown unless I add this line
plt.tight_layout()
plt.show()
When I run this, I obtain the following plot:
output from above script
In this image, question 2 above means that I would want the y-axis labels 'Temperature', 'Velocity', 'Acceleration' to be drawn directly below each of the corresponding axis.
Thanks in advance for any help.
Regards,
L.
What worked for me was ImportanceOfBeingErnest's suggestion of using text (with a line like
host.text(1.2, 0, "Velocity" , ha="left", va="top", rotation=90,
transform=host.transAxes))
instead of trying to control the label position.

matplotlib boxplot with split y-axis

I would like to make a box plot with data similar to this
d = {'Education': [1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4],
'Hours absent': [3, 100,5,7,2,128,4,6,7,1,2,118,2,4,136,1,1]}
df = pd.DataFrame(data=d)
df.head()
This works beautifully:
df.boxplot(column=['Hours absent'] , by=['Education'])
plt.ylim(0, 140)
plt.show()
But the outliers are far away, therefore I would like to split the y-axis.
But here the boxplot commands "column" and "by" are not accepted anymore. So instead of splitting the data by education, I only get one merged data point.
This is my code:
dfnew = df[['Hours absent', 'Education']] # In reality I take the different
columns from a much bigger dataset
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
ax1.boxplot(dfnew['Hours absent'])
ax1.set_ylim(40, 140)
ax2.boxplot(dfnew['Hours absent'])
ax2.set_ylim(0, 40)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax1.xaxis.tick_top()
ax1.tick_params(labeltop='off') # don't put tick labels at the top
ax2.xaxis.tick_bottom()
d = .015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
plt.show()
These are the things I tried (I always changed this both for the first and second subplot) and the errors I got.
ax1.boxplot(dfnew['Hours absent'],dfnew['Education'])
#The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
#a.any() or a.all().
ax1.boxplot(column=dfnew['Hours absent'], by=dfnew['Education'])#boxplot()
#got an unexpected keyword argument 'column'
ax1.boxplot(dfnew['Hours absent'], by=dfnew['Education']) #boxplot() got an
#unexpected keyword argument 'by'
I also tried to convert data into array for y axis and list for x axis:
data = df[['Hours absent']].as_matrix()
labels= list(df['Education'])
print(labels)
print(len(data))
print(len(labels))
print(type(data))
print(type(labels))
And I substituted in the plot command like this:
ax1.boxplot(x=data, labels=labels)
ax2.boxplot(x=data, labels=labels)
Now the error is ValueError: Dimensions of labels and X must be compatible.
But they are both 17 long, I don't understand what is going wrong here.
You are overcomplicating this, the code for breaking the Y-axis is independent of the code for plotting the boxplot. Nothing keeps you from using df.boxplot, it will add some labels and titles you do not want but that is easy to fix.
df.boxplot(column='Hours absent', by='Education', ax=ax1)
ax1.set_xlabel('')
ax1.set_ylim(ymin=90)
df.boxplot(column='Hours absent', by='Education', ax=ax2)
ax2.set_title('')
ax2.set_ylim(ymax=50)
fig.subplots_adjust(top=0.87)
Of course you can also use matplotlib's boxplot, as long as you provide the parameters it needs. According to the docstring it will make
a box and whisker plot for each column of x or each vector in
sequence x
Which means you have to do the "by" part yourself.
grouper = df.groupby('Education')['Hours absent']
x = [grouper.get_group(k) for k in grouper.groups]
ax1.boxplot(x)
ax1.set_ylim(ymin=90)
ax2.boxplot(x)
ax2.set_ylim(ymax=50)

Resources