Need to force overlapping for seaborn's heatmap and kdeplot - python-3.x

I'm trying to combine seaborn's heatmap and kdeplot in one figure, but so far the result is not very promising since I cannot find a way to make them overlap. As a result, the heatmap is just squeezed to the left side of the figure.
I think the reason is that seaborn doesn't seem to recognize the x-axis as the same one in two charts (see picture below), although the data points are exactly the same. The only difference is that for heatmap I needed to pivot them, while for the kdeplot pivoting is not needed.
Therefore, data for the axis are coming from the same dataset, but in the different forms as it can be seen in the code below.
The dataset sample looks something like this:
X Y Z
7,75 280 52,73
3,25 340 54,19
5,75 340 53,61
2,5 180 54,67
3 340 53,66
1,75 340 54,81
4,5 380 55,18
4 240 56,49
4,75 380 55,17
4,25 180 55,40
2 420 56,42
2,25 380 54,90
My code:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(11, 9), dpi=300)
plt.tick_params(bottom='on')
# dataset is just a pandas frame with data
X1 = dataset.iloc[:, :3].pivot("X", "Y", "Z")
X2 = dataset.iloc[:, :2]
ax = sns.heatmap(X1, cmap="Spectral")
ax.invert_yaxis()
ax2 = plt.twinx()
sns.kdeplot(X2.iloc[:, 1], X2.iloc[:, 0], ax=ax2, zorder=2)
ax.axis('tight')
plt.show()
Please help me with placing kdeplot on top of the heatmap. Ideally, I would like my final plot to look something like this:
Any tips or hints will be greatly appreciated!

The question can be a bit hard to understand, because the dataset can't be "just some data". The X and Y values need to lie on a very regular grid. No X,Y combination can be repeated, but not all values appear. The kdeplot will then show where the used values of X,Y are concentrated.
Such a dataset can be simulated by first generating dummy data for a full grid, and then take a subset.
Now, a seaborn heatmap uses categorical X and Y axes. Such axes are very hard to align with the kdeplot. To obtain a similar heatmap with numerical axes, ax.pcolor() can be used.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
xs = np.arange(2, 10, 0.25)
ys = np.arange(150, 400, 10)
# first create a dummy dataset over a full grid
dataset = pd.DataFrame({'X': np.repeat(xs, len(ys)),
'Y': np.tile(ys, len(xs)),
'Z': np.random.uniform(50, 60, len(xs) * len(ys))})
# take a random subset of the rows
dataset = dataset.sample(200)
fig, ax = plt.subplots(figsize=(11, 9), dpi=300)
X1 = dataset.pivot("X", "Y", "Z")
collection = ax.pcolor(X1.columns, X1.index, X1, shading='nearest', cmap="Spectral")
plt.colorbar(collection, ax=ax, pad=0.02)
# default, cut=3, which causes a lot of surrounding whitespace
sns.kdeplot(x=dataset["Y"], y=dataset["X"], cut=1.5, ax=ax)
fig.tight_layout()
plt.show()

Related

How to plot a histogram with plot.hist for continous data in a dataframe in pandas?

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.
Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')
Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

Python: how to create a smoothed version of a 2D binned "color map"?

I would like to create a version of this 2D binned "color map" with smoothed colors.
I am not even sure this would be the correct nomenclature for the plot, but, essentially, I want my figure to be color coded by the median values of a third variable for points that reside in each defined bin of my (X, Y) space.
Even though I am able to accomplish that to a certain degree (see example), I would like to find a way to create a version of the same plot with a smoothed color gradient. That would allow me to visualize the overall behavior of my distribution.
I tried ideas described here: Smoothing 2D map in python
and here: Python: binned_statistic_2d mean calculation ignoring NaNs in data
as well as links therein, but could not find a clear solution to the problem.
This is what I have so far:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy.stats import binned_statistic_2d
import random
random.seed(999)
x = np.random.normal (0,10,5000)
y = np.random.normal (0,10,5000)
z = np.random.uniform(0,10,5000)
fig = plt.figure(figsize=(20, 20))
plt.rcParams.update({'font.size': 10})
ax = fig.add_subplot(3,3,1)
ax.set_axisbelow(True)
plt.grid(b=True, lw=0.5, zorder=-1)
x_bins = np.arange(-50., 50.5, 1.)
y_bins = np.arange(-50., 50.5, 1.)
cmap = plt.cm.get_cmap('jet_r',1000) #just a colormap
ret = binned_statistic_2d(x, y, z, statistic=np.median, bins=[x_bins, y_bins]) # Bin (X, Y) and create a map of the medians of "Colors"
plt.imshow(ret.statistic.T, origin='bottom', extent=(-50, 50, -50, 50), cmap=cmap)
plt.xlim(-40,40)
plt.ylim(-40,40)
plt.xlabel("X", fontsize=15)
plt.ylabel("Y", fontsize=15)
ax.set_yticks([-40,-30,-20,-10,0,10,20,30,40])
bounds = np.arange(2.0, 20.0, 1.0)
plt.colorbar(ticks=bounds, label="Color", fraction=0.046, pad=0.04)
# save plots
plt.savefig("Whatever_name.png", bbox_inches='tight')
Which produces the following image (from random data):
Therefore, the simple question would be: how to smooth these colors?
Thanks in advance!
PS: sorry for excessive coding, but I believe a clear visualization is crucial for this particular problem.
Thanks to everyone who viewed this issue and tried to help!
I ended up being able to solve my own problem. In the end, it was all about image smoothing with Gaussian Kernel.
This link: Gaussian filtering a image with Nan in Python gave me the insight for the solution.
I, basically, implemented the exactly same code, but, in the end, mapped the previously known NaN pixels from the original 2D array to the resulting smoothed version. Unlike the solution from the link, my version does NOT fill NaN pixels with some value derived from the pixels around. Or, it does, but then I erase those again.
Here is the final figure produced for the example I provided:
Final code, for reference, for those who might need in the future:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy.stats import binned_statistic_2d
import scipy.stats as st
import scipy.ndimage
import scipy as sp
import random
random.seed(999)
x = np.random.normal (0,10,5000)
y = np.random.normal (0,10,5000)
z = np.random.uniform(0,10,5000)
fig = plt.figure(figsize=(20, 20))
plt.rcParams.update({'font.size': 10})
ax = fig.add_subplot(3,3,1)
ax.set_axisbelow(True)
plt.grid(b=True, lw=0.5, zorder=-1)
x_bins = np.arange(-50., 50.5, 1.)
y_bins = np.arange(-50., 50.5, 1.)
cmap = plt.cm.get_cmap('jet_r',1000) #just a colormap
ret = binned_statistic_2d(x, y, z, statistic=np.median, bins=[x_bins, y_bins]) # Bin (X, Y) and create a map of the medians of "Colors"
sigma=1 # standard deviation for Gaussian kernel
truncate=5.0 # truncate filter at this many sigmas
U = ret.statistic.T.copy()
V=U.copy()
V[np.isnan(U)]=0
VV=sp.ndimage.gaussian_filter(V,sigma=sigma)
W=0*U.copy()+1
W[np.isnan(U)]=0
WW=sp.ndimage.gaussian_filter(W,sigma=sigma)
np.seterr(divide='ignore', invalid='ignore')
Z=VV/WW
for i in range(len(Z)):
for j in range(len(Z[0])):
if np.isnan(U[i][j]):
Z[i][j] = np.nan
plt.imshow(Z, origin='bottom', extent=(-50, 50, -50, 50), cmap=cmap)
plt.xlim(-40,40)
plt.ylim(-40,40)
plt.xlabel("X", fontsize=15)
plt.ylabel("Y", fontsize=15)
ax.set_yticks([-40,-30,-20,-10,0,10,20,30,40])
bounds = np.arange(2.0, 20.0, 1.0)
plt.colorbar(ticks=bounds, label="Color", fraction=0.046, pad=0.04)
# save plots
plt.savefig("Whatever_name.png", bbox_inches='tight')

Why is Python matplot not starting from the point where my Data starts [duplicate]

So currently learning how to import data and work with it in matplotlib and I am having trouble even tho I have the exact code from the book.
This is what the plot looks like, but my question is how can I get it where there is no white space between the start and the end of the x-axis.
Here is the code:
import csv
from matplotlib import pyplot as plt
from datetime import datetime
# Get dates and high temperatures from file.
filename = 'sitka_weather_07-2014.csv'
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
#for index, column_header in enumerate(header_row):
#print(index, column_header)
dates, highs = [], []
for row in reader:
current_date = datetime.strptime(row[0], "%Y-%m-%d")
dates.append(current_date)
high = int(row[1])
highs.append(high)
# Plot data.
fig = plt.figure(dpi=128, figsize=(10,6))
plt.plot(dates, highs, c='red')
# Format plot.
plt.title("Daily high temperatures, July 2014", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()
plt.ylabel("Temperature (F)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)
plt.show()
There is an automatic margin set at the edges, which ensures the data to be nicely fitting within the axis spines. In this case such a margin is probably desired on the y axis. By default it is set to 0.05 in units of axis span.
To set the margin to 0 on the x axis, use
plt.margins(x=0)
or
ax.margins(x=0)
depending on the context. Also see the documentation.
In case you want to get rid of the margin in the whole script, you can use
plt.rcParams['axes.xmargin'] = 0
at the beginning of your script (same for y of course). If you want to get rid of the margin entirely and forever, you might want to change the according line in the matplotlib rc file:
axes.xmargin : 0
axes.ymargin : 0
Example
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
tips.plot(ax=ax1, title='Default Margin')
tips.plot(ax=ax2, title='Margins: x=0')
ax2.margins(x=0)
Alternatively, use plt.xlim(..) or ax.set_xlim(..) to manually set the limits of the axes such that there is no white space left.
If you only want to remove the margin on one side but not the other, e.g. remove the margin from the right but not from the left, you can use set_xlim() on a matplotlib axes object.
import seaborn as sns
import matplotlib.pyplot as plt
import math
max_x_value = 100
x_values = [i for i in range (1, max_x_value + 1)]
y_values = [math.log(i) for i in x_values]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
sn.lineplot(ax=ax1, x=x_values, y=y_values)
sn.lineplot(ax=ax2, x=x_values, y=y_values)
ax2.set_xlim(-5, max_x_value) # tune the -5 to your needs

Seaborn barplot with two y-axis

considering the following pandas DataFrame:
labels values_a values_b values_x values_y
0 date1 1 3 150 170
1 date2 2 6 200 180
It is easy to plot this with Seaborn (see example code below). However, due to the big difference between values_a/values_b and values_x/values_y, the bars for values_a and values_b are not easily visible (actually, the dataset given above is just a sample and in my real dataset the difference is even bigger). Therefore, I would like to use two y-axis, i.e., one y-axis for values_a/values_b and one for values_x/values_y. I tried to use plt.twinx() to get a second axis but unfortunately, the plot shows only two bars for values_x and values_y, even though there are at least two y-axis with the right scaling. :) Do you have an idea how to fix that and get four bars for each label whereas the values_a/values_b bars relate to the left y-axis and the values_x/values_y bars relate to the right y-axis?
Thanks in advance!
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
# working example but with unreadable values_a and values_b
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted)
plt.show()
# values_a and values_b are not displayed
values1_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_a", "values_b"],\
var_name="source1", value_name="value_numbers1")
values2_melted = pd.melt(test_data, id_vars=columns[0],\
value_vars=["values_x", "values_y"],\
var_name="source2", value_name="value_numbers2")
g1 = sns.barplot(x=columns[0], y="value_numbers1", hue="source1",\
data=values1_melted)
ax2 = plt.twinx()
g2 = sns.barplot(x=columns[0], y="value_numbers2", hue="source2",\
data=values2_melted, ax=ax2)
plt.show()
This is probably best suited for multiple sub-plots, but if you are truly set on a single plot, you can scale the data before plotting, create another axis and then modify the tick values.
Sample Data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
columns = ["labels", "values_a", "values_b", "values_x", "values_y"]
test_data = pd.DataFrame.from_records([("date1", 1, 3, 150, 170),\
("date2", 2, 6, 200, 180)],\
columns=columns)
test_data_melted = pd.melt(test_data, id_vars=columns[0],\
var_name="source", value_name="value_numbers")
Code:
# Scale the data, just a simple example of how you might determine the scaling
mask = test_data_melted.source.isin(['values_a', 'values_b'])
scale = int(test_data_melted[~mask].value_numbers.mean()
/test_data_melted[mask].value_numbers.mean())
test_data_melted.loc[mask, 'value_numbers'] = test_data_melted.loc[mask, 'value_numbers']*scale
# Plot
fig, ax1 = plt.subplots()
g = sns.barplot(x=columns[0], y="value_numbers", hue="source",\
data=test_data_melted, ax=ax1)
# Create a second y-axis with the scaled ticks
ax1.set_ylabel('X and Y')
ax2 = ax1.twinx()
# Ensure ticks occur at the same positions, then modify labels
ax2.set_ylim(ax1.get_ylim())
ax2.set_yticklabels(np.round(ax1.get_yticks()/scale,1))
ax2.set_ylabel('A and B')
plt.show()

Seaborn right ytick [duplicate]

This question already has answers here:
multiple axis in matplotlib with different scales [duplicate]
(3 answers)
Closed 5 years ago.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
d = ['d1','d2','d3','d4','d5','d6']
value = [111111, 222222, 333333, 444444, 555555, 666666]
y_cumsum = np.cumsum(value)
sns.barplot(d, value)
sns.pointplot(d, y_cumsum)
plt.show()
I'm trying to make pareto diagram with barplot and pointplot. But I can't print percentages to the right side ytick. By the way, if I manuplate yticks it overlaps itself.
plt.yticks([1,2,3,4,5])
overlaps like in the image.
Edit: I mean that I want to quarter percentages (0, 25%, 50%, 75%, 100%) on the right hand side of the graphic, as well.
From what I understood, you want to show the percentages on the right hand side of your figure. To do that, we can create a second y axis using twinx(). All we need to do then is to set the limits of this second axis appropriately, and set some custom labels:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
d = ['d1','d2','d3','d4','d5','d6']
value = [111111, 222222, 333333, 444444, 555555, 666666]
fig, ax = plt.subplots()
ax2 = ax.twinx() # create a second y axis
y_cumsum = np.cumsum(value)
sns.barplot(d, value, ax=ax)
sns.pointplot(d, y_cumsum, ax=ax)
y_max = y_cumsum.max() # maximum of the array
# find the percentages of the max y values.
# This will be where the "0%, 25%" labels will be placed
ticks = [0, 0.25*y_max, 0.5*y_max, 0.75*y_max, y_max]
ax2.set_ylim(ax.get_ylim()) # set second y axis to have the same limits as the first y axis
ax2.set_yticks(ticks)
ax2.set_yticklabels(["0%", "25%","50%","75%","100%"]) # set the labels
ax2.grid("off")
plt.show()
This produces the following figure:

Resources