Annotate seaborn clustermap with Pandas Dataframe - python-3.x

I am using seaborn (v.0.7.0) to plot a heat-map. Here is my code:
Updated code after fixing the problem
### Get Data
sns.set(style="white")
adata = pd.read_csv("Test.txt", sep="\t",index_col=0)
adata_log = np.log2(adata)
e = adata.iloc[0:7,0:3]
e_log = adata_log.iloc[0:7,0:3]
#### Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
#### Set color
hmcol = ["#ffffff","#ffffff","#fbe576","#c06e36","#9a2651"]
cmap = sns.blend_palette(hmcol,as_cmap=True)
#### Plot clustermap
sns.set(font_scale=0.8) ## 0.8 for normal use
aplot = sns.clustermap(e_log,cmap=cmap,method='average', metric='euclidean',standard_scale=None,row_cluster=False,col_cluster=False,row_linkage=None,col_linkage=None,linewidths=.05,square=True,annot=e,annot_kws={"size": 15},fmt='.2f')
aplot.cax.set_visible(False) #remove color bar
plt.setp(aplot.ax_heatmap.xaxis.get_majorticklabels(), rotation=90) ## Y-Axis label rotations
plt.setp(aplot.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) ## X-Axis label rotations
##Save Figure
aplot.savefig(“Test-Fig1.0.pdf",orientation='potrait',dpi=600)
Is there any way I can use values in the dataframe ‘e’ as annotations? I tried
annot=e
in clustermap but its giving me an error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Also, it there anyway I plot the figure in landscape mode? Here is the data and figure from above code:
print(e)
X Y Z
A 100.72 90.20 13.58
B 160.98 162.24 12.85
C 6.76 8.03 0.66
D 241.49 277.89 29.43
E 156.78 145.54 30.72
F 6.09 5.96 0.93
G 4.57 1.16 0.74

Upgrading 'seaborn' to v0.7.1 solved the annotation problem. I have updated my answer with fix.
For plotting in landscape, I guess easiest way would be to change my data from tall to long order i.e. change the input test file file by copying data and pasting as transposed.
Bade

Related

Control marker properties in seaborn pairwise boxplot

I'm trying to plot a boxplot for two different datasets on the same plot. The x axis are the hours in a day, while the y axis goes from 0 to 1 (let's call it Efficiency). I would like to have different markers for the means of each dataset' boxes. I use the 'meanprops' for seaborn but that changes the marker style for both datasets at the same time. I've added 2000 lines of data in the excel that can be downloaded here. The values might not coincide with the ones in the picture but should be enough.
Basically I want the red squares to be blue on the orange boxplot, and red on the blue boxplot. Here is what I managed to do so far:
I tried changing the meanprops by using a dictionary with the labels as keys , but it seems to be entering a loop (in PyCharm is says Evaluating...)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
#make sure you have your path sorted out
group1 = pd.read_excel('group1.xls')
ax,fig = plt.subplots(figsize = (20,10))
#does not work
#ax = sns.boxplot(data=group1, x='hour', y='M1_eff', hue='labels',showfliers=False, showmeans=True,\
# meanprops={"marker":{'7':"s",'8':'s'},"markerfacecolor":{'7':"white",'8':'white'},
#"markeredgecolor":{'7':"blue",'8':'red'})
#works but produces similar markers
ax = sns.boxplot(data=group1, x='hour', y='M1_eff', hue='labels',showfliers=False, showmeans=True,\
meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"blue"})
plt.legend(title='Groups', loc=2, bbox_to_anchor=(1, 1),borderaxespad=0.5)
# Add transparency to colors
for patch in ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .4))
ax.set_xlabel("Hours",fontsize=14)
ax.set_ylabel("M1 Efficiency",fontsize=14)
ax.tick_params(labelsize=10)
plt.show()
I also tried the FacetGrid but to no avail (Stops at 'Evaluating...'):
g = sns.FacetGrid(group1, col="M1_eff", hue="labels",hue_kws=dict(marker=["^", "v"]))
g = (g.map(plt.boxplot, "hour", "M1_eff")
.add_legend())
g.show()
Any help is appreciated!
I don't think you can do this using sns.boxplot() directly. I think you'll have to draw the means "by hand"
N=100
df = pd.DataFrame({'hour':np.random.randint(0,3,size=(N,)),
'M1_eff': np.random.random(size=(N,)),
'labels':np.random.choice([7,8],size=(N,))})
x_col = 'hour'
y_col = 'M1_eff'
hue_col = 'labels'
width = 0.8
hue_order=[7,8]
marker_colors = ['red','blue']
# get the offsets used by boxplot when hue-nesting is used
# https://github.com/mwaskom/seaborn/blob/c73055b2a9d9830c6fbbace07127c370389d04dd/seaborn/categorical.py#L367
n_levels = len(hue_order)
each_width = width / n_levels
offsets = np.linspace(0, width - each_width, n_levels)
offsets -= offsets.mean()
fig, ax = plt.subplots()
ax = sns.boxplot(data=df, x=x_col, y=y_col, hue=hue_col, hue_order=hue_order, showfliers=False, showmeans=False)
means = df.groupby([hue_col,x_col])[y_col].mean()
for (gr,temp),o,c in zip(means.groupby(level=0),offsets,marker_colors):
ax.plot(np.arange(temp.values.size)+o, temp.values, 's', c=c)

Curve fitting for large datasets in Python

I have a very large set of data, ( around 100k points) and I want to fit a curve to this plot.
I tried the filters suggested by answers to another question, but that lead to overfitting.
I am using numpy and matplotlib as of now.
This is the type of scatter plot I am trying to fit.
Edit 1:
Please ignore the data points to the side of the central main set of data points(Thus only a single curve can fit this)
Here is the dataset, download the file as a text file to separate the columns, consider the columns 3 and 9 ( 1-based indexing), the y-axis has column 3 while the x-axis plots the difference of column 3 and column 9.
Edit 2: Ignore the negative values
Edit 3: As there appears to be a lot of noise, consider the column 33 which accounts for probability and consider stars only which have >90% probability
Here is are comparison scatterplots using the data in your link, along with the python code I used to read, parse, and plot the data. Note that my plot also has an inverted y axis for direct comparison. This shows me that the data in the posted link, parsed per your directions, cannot be fit as it is per your question. My hope is that you can find some error in my work, and a model can in fact be made.
import matplotlib.pyplot as plt
dataFileName = 'temp.dat'
dataCount = 0
xlist = []
ylist = []
with open(dataFileName) as f:
for line in f:
if line[0] == '#': # comments
continue
spl = line.split()
col3 = float(spl[2])
col9 = float(spl[8])
if col3 < 0.0 or col9 < 0.0:
continue
x = abs(col3 - col9)
y = col3
xlist.append(x)
ylist.append(y)
f = plt.figure()
axes = f.add_subplot(111)
axes.invert_yaxis()
axes.scatter(xlist, ylist,color='black', marker='o', lw=0, s=1)
plt.show()

seaborn:stripplot how to color individual point based on value

I am trying to color specific points on the seaborn:stripplot based on a defined value. For example,
value=(df['x']=0.0
I know you can do this with regplot, etc using:
df['color']= np.where( value==True , "#9b59b6") and the
scatter_kws={'facecolors':df['color']}
Is there a way to do it for the pair grid and stripplot? Specifically, to color a specified value in t1 or t2 below?
I have also tried passing the match var in hue. However, this produced image #2 below and is not what I am looking for.
Here is the df:
par t1 t2 found
30000.0 0.50 0.45 yes
10000.0 0.30 0.12 yes
3000.0 0.40 0.00 no
Here is my code:
# Import dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
# Make the PairGrid
g = sns.PairGrid(df.sort_values("par", ascending=True),
x_vars=df.columns[1:3], y_vars=["par"],
height=5, aspect=.65)
# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=16, orient="h",
linewidth=1, edgecolor="gray", palette="ch:2.5,-.2,dark=.3")
sns.set_style("darkgrid")
# Use the same x axis limits on all columns and add better labels
g.set(xlim=(-0.1, 1.1), xlabel="% AF", ylabel="")
# Use semantically meaningful titles for the columns
titles = ["Test 1", "Test 2"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title=title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
sns.despine(left=True, bottom=True)
I am trying to color the point with the value=0.0 a different color:
Passing the match var into hue produces the below and removes the 3rd par=3000 value and collapses the plot. I can categorize the outliers I want highlighted a different color. However, the outlier value is removed from the y-axis and plot collapsed..
Are you sure you want a stripplot? It looks to me like you are actually trying to plot a scatterplot?
Also, I think you want to use a FacetGrid instead of a PairGrid? In that case, it requires to transform your dataframe in "long-form".
Here is what I got:
df2 = df.melt(id_vars=['par','found'], var_name='Test', value_name='AF')
g = sns.FacetGrid(data=df2, col='Test', col_order=['t1','t2'], hue='found',
height=5, aspect=.65)
g.map(sns.scatterplot, 'AF','par',
s=100, linewidth=1, edgecolor="gray", palette="ch:2.5,-.2,dark=.3")

matplotlib boxplot with split y-axis

I would like to make a box plot with data similar to this
d = {'Education': [1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4],
'Hours absent': [3, 100,5,7,2,128,4,6,7,1,2,118,2,4,136,1,1]}
df = pd.DataFrame(data=d)
df.head()
This works beautifully:
df.boxplot(column=['Hours absent'] , by=['Education'])
plt.ylim(0, 140)
plt.show()
But the outliers are far away, therefore I would like to split the y-axis.
But here the boxplot commands "column" and "by" are not accepted anymore. So instead of splitting the data by education, I only get one merged data point.
This is my code:
dfnew = df[['Hours absent', 'Education']] # In reality I take the different
columns from a much bigger dataset
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
ax1.boxplot(dfnew['Hours absent'])
ax1.set_ylim(40, 140)
ax2.boxplot(dfnew['Hours absent'])
ax2.set_ylim(0, 40)
ax1.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax1.xaxis.tick_top()
ax1.tick_params(labeltop='off') # don't put tick labels at the top
ax2.xaxis.tick_bottom()
d = .015 # how big to make the diagonal lines in axes coordinates
# arguments to pass to plot, just so we don't keep repeating them
kwargs = dict(transform=ax1.transAxes, color='k', clip_on=False)
ax1.plot((-d, +d), (-d, +d), **kwargs) # top-left diagonal
ax1.plot((1 - d, 1 + d), (-d, +d), **kwargs) # top-right diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d, +d), (1 - d, 1 + d), **kwargs) # bottom-left diagonal
ax2.plot((1 - d, 1 + d), (1 - d, 1 + d), **kwargs) # bottom-right diagonal
plt.show()
These are the things I tried (I always changed this both for the first and second subplot) and the errors I got.
ax1.boxplot(dfnew['Hours absent'],dfnew['Education'])
#The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
#a.any() or a.all().
ax1.boxplot(column=dfnew['Hours absent'], by=dfnew['Education'])#boxplot()
#got an unexpected keyword argument 'column'
ax1.boxplot(dfnew['Hours absent'], by=dfnew['Education']) #boxplot() got an
#unexpected keyword argument 'by'
I also tried to convert data into array for y axis and list for x axis:
data = df[['Hours absent']].as_matrix()
labels= list(df['Education'])
print(labels)
print(len(data))
print(len(labels))
print(type(data))
print(type(labels))
And I substituted in the plot command like this:
ax1.boxplot(x=data, labels=labels)
ax2.boxplot(x=data, labels=labels)
Now the error is ValueError: Dimensions of labels and X must be compatible.
But they are both 17 long, I don't understand what is going wrong here.
You are overcomplicating this, the code for breaking the Y-axis is independent of the code for plotting the boxplot. Nothing keeps you from using df.boxplot, it will add some labels and titles you do not want but that is easy to fix.
df.boxplot(column='Hours absent', by='Education', ax=ax1)
ax1.set_xlabel('')
ax1.set_ylim(ymin=90)
df.boxplot(column='Hours absent', by='Education', ax=ax2)
ax2.set_title('')
ax2.set_ylim(ymax=50)
fig.subplots_adjust(top=0.87)
Of course you can also use matplotlib's boxplot, as long as you provide the parameters it needs. According to the docstring it will make
a box and whisker plot for each column of x or each vector in
sequence x
Which means you have to do the "by" part yourself.
grouper = df.groupby('Education')['Hours absent']
x = [grouper.get_group(k) for k in grouper.groups]
ax1.boxplot(x)
ax1.set_ylim(ymin=90)
ax2.boxplot(x)
ax2.set_ylim(ymax=50)

Seaborn Complex Heatmap---drawing circles within tiles to denote complex annotation

I have two data-frames in python.
data_A
Name X Y
A 1 0
B 1 1
C 0 0
data_B
Name X Y
A 0 1
B 1 1
C 0 1
I would like to overlap these heatmaps, where if it is a 1 in data_frame A, then the tile is colored purple (or any color), but if it's a 1 in data_frame B, then a circle is drawn (preferably the first one).
So for example, the heatmap would show A[,X][1] colored purple, but those with 1 in both data frames would be purple with a dot. C[,Y][3] would have just a dot, while C[,X][3] would have nothing.
I can seem to mask, with seaborn, and plot two heatmaps with different colors, but the color differential isn't clear enough that a user can simply see that a tile has only one versus both. I think having a circle to denote a positive in one matrix would be better.
Does anyone have an idea of how to plot circles onto a heatmap using seaborn?
To show a heatmap you may use an imshow plot. To show some dots, you may use a scatter plot. Then just plot both in the same axes.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
dfA = pd.DataFrame([[1,0],[1,1],[0,0]], columns=list("XY"), index=list("ABC"))
dfB = pd.DataFrame([[0,1],[1,1],[0,1]], columns=list("XY"), index=list("ABC"))
assert dfA.shape == dfB.shape
x = np.arange(0,len(dfA.columns))
y = np.arange(0,len(dfB.index))
X,Y=np.meshgrid(x,y)
fig, ax = plt.subplots(figsize=(2.6,3))
ax.invert_yaxis()
ax.imshow(dfA.values, aspect="auto", cmap="Purples")
cond = dfB.values == 1
ax.scatter(X[cond], Y[cond], c="crimson", s=100)
ax.set_xticks(x)
ax.set_yticks(y)
ax.set_xticklabels(dfA.columns)
ax.set_yticklabels(dfA.index)
plt.show()
Alternatives to using a dot to show several datasets on the same heatmap could also
Plotting two distance matrices together on same plot?
something like plt.matshow but with triangles
Now, you can directly plot complex heatmap using python package PyComplexHeatmap: https://github.com/DingWB/PyComplexHeatmap
https://github.com/DingWB/PyComplexHeatmap/blob/main/examples.ipynb

Resources