Python seaborn heatmap grid - Not taking expected columns - python-3.x

I have following pandas dataframe. Basically, 7 different action categories, 5 different targets, each category has 1 or many unique endpoints, then each endpoint got a certain score in each target.
There are total 250 endpoints.
action,target,endpoint,score
Category1,target1,endpoint1,813.0
Category1,target2,endpoint1,757.0
Category1,target3,endpoint1,155.0
Category1,target4,endpoint1,126.0
Category1,target5,endpoint1,75.5
Category2,target1,endpoint2,106.0
Category2,target1,endpoint3,101.0
Category2,target1,endpoint4,499.0
Category2,target1,endpoint5,207.0
Category2,target2,endpoint2,316.0
Category2,target2,endpoint3,208.0
Category2,target2,endpoint4,161.0
Category2,target2,endpoint5,198.0
<omit>
Category3,target1,endpoint8,193.0
Category3,target1,endpoint9,193.0
Category3,target1,endpoint10,193.0
Category3,target1,endpoint11,193.0
Category3,target2,endpoint8,193.0
Category3,target2,endpoint9,193.0
<List goes on...>
Now, I wanted to map out this dataframe as heatmap per category.
So, I used seabron facet grid heatmap with the following code.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('rawData.csv')
data = data.drop('Unnamed: 0', 1)
def facet_heatmap(data, **kwargs):
data2 = data.pivot(index="target", columns='endpoint', values='score')
ax1 = sns.heatmap(data2, cmap="YlGnBu", linewidths=2)
for item in ax1.get_yticklabels():
item.set_rotation(0)
for item in ax1.get_xticklabels():
item.set_rotation(70)
with sns.plotting_context(font_scale=5.5):
g = sns.FacetGrid(data, col="action", col_wrap=7, size=5, aspect=0.5)
cbar_ax = g.fig.add_axes([.92, .3, .02, .4])
g = g.map_dataframe(facet_heatmap, cbar=cbar_ax, min=0, vmax=2000)
# <-- Specify the colorbar axes and limits
g.set_titles(col_template="{col_name}", fontweight='bold', fontsize=18)
g.fig.subplots_adjust(right=3) # <-- Add space so the colorbar doesn't overlap the plot
plt.savefig('seabornPandas.png', dpi=400)
plt.show()
It actually generates heatmap grid. However, the problem is the each heatmap uses the same column for some reason. See attached screenshot below.
(Please ignore color bar and limits.)
This is quite odd. First, the Index is not in order. Second, each heatmap box only takes the last three endpoints (Endpoint 248, 249, and 250). This is incorrect. For category 1, it should take endpoint 1 only. I don't expect a gray box there..
For category2, it should take endpoint 2,3,4,5. Not endpoint 248, 249, 250.
How can I fix these two issues? Any suggestion or comments are welcome.

as mwaskom suggested: use the sharex parameter to fix your issues:
...
with sns.plotting_context(font_scale=5.5):
g = sns.FacetGrid(data, col="action", col_wrap=7, size=5, aspect=0.5,
sharex=False)
...

Related

How to plot a histogram with plot.hist for continous data in a dataframe in pandas?

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.
Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')
Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

MatPlotLib Plot last few items differently

I'm exploring MatPlotLib and would like to know if it is possible to show last few items in a dataset differently.
Example: If my dataset contains 100 numbers, I want to display last 5 items in different color.
So far I could do it with one last record using annotate, but want to show last few items dotted with 'red' color as against the blue line.
I could finally achieve this by changing few things in my code.
Below is what I have done.
Let me know in case there is a better way. :)
series_df = pd.read_csv('my_data.csv')
series_df = series_df.fillna(0)
series_df = series_df.sort_values(['Date'], ascending=True)
# Created a new DataFrame for last 5 items series_df2
plt.plot(series_df["Date"],series_df["Values"],color="red", marker='+')
plt.plot(series_df2["Date"],series_df2["Values"],color="blue", marker='+')
You should add some minimal code example or a figure with the desired output to make your question clear. It seems you want to highlight some of the last few points with a marker. You can achieve this by calling plot() twice:
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.arange(N)
y = np.random.rand(N)
plt.figure()
plt.plot(x, y)
plt.plot(x[-5:], y[-5:], ls='', c='tab:red', marker='.', ms=10)

Seaborn barplot with two y-axis

considering the following pandas DataFrame:
labels values_a values_b values_x values_y
0 date1 1 3 150 170
1 date2 2 6 200 180
It is easy to plot this with Seaborn (see example code below). However, due to the big difference between values_a/values_b and values_x/values_y, the bars for values_a and values_b are not easily visible (actually, the dataset given above is just a sample and in my real dataset the difference is even bigger). Therefore, I would like to use two y-axis, i.e., one y-axis for values_a/values_b and one for values_x/values_y. I tried to use plt.twinx() to get a second axis but unfortunately, the plot shows only two bars for values_x and values_y, even though there are at least two y-axis with the right scaling. :) Do you have an idea how to fix that and get four bars for each label whereas the values_a/values_b bars relate to the left y-axis and the values_x/values_y bars relate to the right y-axis?
Thanks in advance!
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
# working example but with unreadable values_a and values_b
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted)
plt.show()
# values_a and values_b are not displayed
values1_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_a", "values_b"],\
var_name="source1", value_name="value_numbers1")
values2_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_x", "values_y"],\
var_name="source2", value_name="value_numbers2")
g1 = sns.barplot(x=columns[0], y="value_numbers1", hue="source1",\
data=values1_melted)
ax2 = plt.twinx()
g2 = sns.barplot(x=columns[0], y="value_numbers2", hue="source2",\
data=values2_melted, ax=ax2)
plt.show()
This is probably best suited for multiple sub-plots, but if you are truly set on a single plot, you can scale the data before plotting, create another axis and then modify the tick values.
Sample Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
Code:
# Scale the data, just a simple example of how you might determine the scaling
mask = test_data_melted.source.isin(['values_a', 'values_b'])
scale = int(test_data_melted[~mask].value_numbers.mean()
/test_data_melted[mask].value_numbers.mean())
test_data_melted.loc[mask, 'value_numbers'] = test_data_melted.loc[mask, 'value_numbers']*scale
# Plot
fig, ax1 = plt.subplots()
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted, ax=ax1)
# Create a second y-axis with the scaled ticks
ax1.set_ylabel('X and Y')
ax2 = ax1.twinx()
# Ensure ticks occur at the same positions, then modify labels
ax2.set_ylim(ax1.get_ylim())
ax2.set_yticklabels(np.round(ax1.get_yticks()/scale,1))
ax2.set_ylabel('A and B')
plt.show()

Multiple heatmaps with fixed grid size

I am using seaborn(v.0.7.1) together with matplotlib(1.5.1) and pandas (v.0.18.1) to plot different clusters of data of different sizes as heat maps within a for loop as shown in the following code.
My issue is that since each cluster contains different number of rows, the final figures are of different sizes (i.e. the height and width of each box in the heat map is different across different heat maps)(see figures). Eventually, I would like to have figures of the same size (as explained above).
I have checked some parts of seabornand matplotlib documentations as well as stackoverflowbut since I do not know what the exact keywords are to look for (as evident in the question title itself) I have not been able to find any answer. [EDIT: Now I have updated the title based on a suggestion from #ImportanceOfBeingErnest. Previously the title was read: "Enforcing the same width across multiple plots".]
import numpy as np
import pandas as pd
clusters = pd.DataFrame([(1,'aaaaaaaaaaaaaaaaa'),(1,'b'), (1,'c'), (1,'d'), (2,'e'), (2,'f')])
clusters.columns = ['c', 'p']
clusters.set_index('c', inplace=True)
g = pd.DataFrame(np.ones((6,4)))
c= pd.DataFrame([(1,'aaaaaaaaaaaaaaaaa'),(2,'b'), (3,'c'), (4,'d'), (5,'e'), (6,'f')])
c.columns = ['i', 'R']
for i in range(1,3,1):
ee = clusters[clusters.index==i].p
inds = []
for v in ee:
inds.append(np.where(c.R.values == v)[0][0])
f, ax = plt.subplots(1, figsize=(13, 15))
ax = sns.heatmap(g.iloc[inds], square=True, ax=ax, cbar=True, linewidths=2, linecolor='k', cmap="Reds", cbar_kws={"shrink": .5},
vmin = math.floor(g.values.min()), vmax =math.ceil(g.values.max()))
null = ax.set_xticklabels(['a', 'b', 'c', 'd'], fontsize=15)
null = ax.set_yticklabels(c.R.values[inds][::-1], fontsize=15, rotation=0)
plt.tight_layout(pad=3)
[EDIT]: Now I have added some code to create a minimal, functional example as suggested by #Brian. Now I have noticed that the issue might have been caused by the text!
Under the following conditions
If only the squares in the saved images should have the same size and we don't care about the plot on screen and
We can omit the colorbar
the solution is rather straight forward.
One would define the size that one square should have in the final image squaresize = 50, find out the number of squares to draw in each dimension (n, m) and adjust the figure size as
figwidth = m*squaresize/float(dpi)
figheight = n*squaresize/float(dpi)
where dpi denotes the pixels per inch.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
dpi=100
squaresize = 50 # pixels
n = 3
m = 4
data = np.random.rand(n,m)
figwidth = m*squaresize/float(dpi)
figheight = n*squaresize/float(dpi)
f, ax = plt.subplots(1, figsize=(figwidth, figheight), dpi=dpi)
f.subplots_adjust(left=0, right=1, bottom=0, top=1)
ax = sns.heatmap(data, square=True, ax=ax, cbar=False)
plt.savefig(__file__+".png", dpi=dpi, bbox_inches="tight")
The bbox_inches="tight" makes sure that the labels etc. are still drawn (i.e. the final figure size will be larger than the one calculated here, depending on how much space the labels need).
To apply this example to your case you'd still need to find out how many rows and columns you have in the heatmap depending on the dataframe, but as I don't have it's structure, it's hard to provide a general solution.

Bokeh secondary y range affecting primary y range

I'm working on building a Bokeh plot using bokeh.plotting. I have two series with a shared index that I want to plot two vertical bars for. When I use a single bar everything works fine, but when I add a second y range and the second bar it seems to be impacting the primary y range (changes the vales from 0 to 4), and my second vbar() overlays the first. Any assistance on why the bars overlap instead of being side by side and why the second series/yaxis seems to impact the first even though they are separate would be appreciated.
import pandas as pd
import bokeh.plotting as bp
from bokeh.models import NumeralTickFormatter, HoverTool, Range1d, LinearAxis
df_x_series = ['a','b','c']
fig = bp.figure(title='WIP',x_range=df_x_series,plot_width=1200,plot_height=600,toolbar_location='below',toolbar_sticky=False,tools=['reset','save'],active_scroll=None,active_drag=None,active_tap=None)
fig.title.align= 'center'
fig.extra_y_ranges = {'c_count':Range1d(start=0, end=10)}
fig.add_layout(LinearAxis(y_range_name='c_count'), 'right')
fig.vbar(bottom=0, top=[1,2,3], x=['a','b','c'], color='blue', legend='Amt', width=0.3, alpha=0.5)
fig.vbar(bottom=0, top=[5,7,8], x=['a','b','c'], color='green', legend='Ct', width=0.3, alpha=0.8, y_range_name='c_count')
fig.yaxis[0].formatter = NumeralTickFormatter(format='0.0')
bp.output_file('bar.html')
bp.show(fig)
Here's the plot I believe you want:
And here's the code:
import bokeh.plotting as bp
from bokeh.models import NumeralTickFormatter, Range1d, LinearAxis
df_x_series = ['a', 'b', 'c']
fig = bp.figure(
title='WIP',
x_range=df_x_series,
y_range=Range1d(start=0, end=4),
plot_width=1200, plot_height=600,
toolbar_location='below',
toolbar_sticky=False,
tools=['reset', 'save'],
active_scroll=None, active_drag=None, active_tap=None
)
fig.title.align = 'center'
fig.extra_y_ranges = {'c_count': Range1d(start=0, end=10)}
fig.add_layout(LinearAxis(y_range_name='c_count'), 'right')
fig.vbar(bottom=0, top=[1, 2, 3], x=['a:0.35', 'b:0.35', 'c:0.35'], color='blue', legend='Amt', width=0.3, alpha=0.5)
fig.vbar(bottom=0, top=[5, 7, 8], x=['a:0.65', 'b:0.65', 'c:0.65'], color='green', legend='Ct', width=0.3, alpha=0.8, y_range_name='c_count')
fig.yaxis[0].formatter = NumeralTickFormatter(format='0.0')
bp.output_file('bar.html')
bp.show(fig)
A couple of notes:
Categorical axes are currently a bit (ahem) ugly in Bokeh. We hope to address this in the coming months. Each one has a scale of 0 - 1 after a colon which allows you to move things left and right. So I move the first bar to the left by 0.3/2 and the second bar to the right by 0.3/2 (0.3 because that's the width you had used)
The y_range changed because you were using the default y_range for your initial y_range which is a DataRange1d. DataRange uses all the data for the plot to pick its values and adds some padding which is why it was starting at below 0 and going up to the max of your new data. By manually specifying a range in the figure call you get around this.
Thanks for providing a code sample to work from :D

Resources