How to plot heatmap for high-dimensional dataset? - python-3.x

I would greatly appreciate if you could let me know how to plot high-resolution heatmap for a large dataset with approximately 150 features.
My code is as follows:
XX = pd.read_csv('Financial Distress.csv')
y = np.array(XX['Financial Distress'].values.tolist())
y = np.array([0 if i > -0.50 else 1 for i in y])
XX = XX.iloc[:, 3:87]
df=XX
df["target_var"]=y.tolist()
target_var=["target_var"]
fig, ax = plt.subplots(figsize=(8, 6))
correlation = df.select_dtypes(include=['float64',
'int64']).iloc[:, 1:].corr()
sns.heatmap(correlation, ax=ax, vmax=1, square=True)
plt.xticks(rotation=90)
plt.yticks(rotation=360)
plt.title('Correlation matrix')
plt.tight_layout()
plt.show()
k = df.shape[1] # number of variables for heatmap
fig, ax = plt.subplots(figsize=(9, 9))
corrmat = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corrmat, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cols = corrmat.nlargest(k, target_var)[target_var].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.0)
hm = sns.heatmap(cm, mask=mask, cbar=True, annot=True,
square=True, fmt='.2f', annot_kws={'size': 7},
yticklabels=cols.values,
xticklabels=cols.
values)
plt.xticks(rotation=90)
plt.yticks(rotation=360)
plt.title('Annotated heatmap matrix')
plt.tight_layout()
plt.show()
It works fine but the plotted heatmap for a dataset with more than 40 features is too small.
Thanks in advance,

Adjusting the figsize and dpi worked for me.
I adapted your code and doubled the size of the heatmap to 165 x 165. The rendering takes a while, but the png looks fine. My backend is "module://ipykernel.pylab.backend_inline."
As noted in my original answer, I'm pretty sure you forgot close the figure object before creating a new one. Try plt.close("all") before fig, ax = plt.subplots() if you get wierd effects.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
print(plt.get_backend())
# close any existing plots
plt.close("all")
df = pd.read_csv("Financial Distress.csv")
# select out the desired columns
df = df.iloc[:, 3:].select_dtypes(include=['float64','int64'])
# copy columns to double size of dataframe
df2 = df.copy()
df2.columns = "c_" + df2.columns
df3 = pd.concat([df, df2], axis=1)
# get the correlation coefficient between the different columns
corr = df3.iloc[:, 1:].corr()
arr_corr = corr.as_matrix()
# mask out the top triangle
arr_corr[np.triu_indices_from(arr_corr)] = np.nan
fig, ax = plt.subplots(figsize=(24, 18))
hm = sns.heatmap(arr_corr, cbar=True, vmin=-0.5, vmax=0.5,
fmt='.2f', annot_kws={'size': 3}, annot=True,
square=True, cmap=plt.cm.Blues)
ticks = np.arange(corr.shape[0]) + 0.5
ax.set_xticks(ticks)
ax.set_xticklabels(corr.columns, rotation=90, fontsize=8)
ax.set_yticks(ticks)
ax.set_yticklabels(corr.index, rotation=360, fontsize=8)
ax.set_title('correlation matrix')
plt.tight_layout()
plt.savefig("corr_matrix_incl_anno_double.png", dpi=300)
full figure:
zoom of top left section:

If I understand your problem correctly, I think all you have to do is increase you figure size:
f, ax = plt.subplots(figsize=(20, 20))
instead of
f, ax = plt.subplots(figsize=(9, 9))

Related

How reduce the scale of a scatter plot with row coordinates to merge it with a circle of correlations to make a bibplot?

I have a dataset composed of data with the same unit of measurement. Before making my pca, I centered my data using sklearn.preprocessing.StandardScaler(with_std=False).
I don't understand why but using the sklearn.decomposition.PCA.fit_transform(<my_dataframe>) method when I want to display a correlation circle I get two perfectly represented orthogonal variables, thus indicating that they are independent, but they are not. With a correlation matrix I observe perfectly that they are anti-correlated.
Through dint of research I came across the "prince" package which manages to get the perfect coordinates of my centered but unscaled variables.
When I do my pca with it, I can perfectly display the projection of my lines. It also has the advantage of being able to display ellipses. The only problem is that there is no function for a bibplot.
I managed to display a circle of correlations using the column_correlations() method to get the coordinates of the variables. By tinkering here is what I managed to get:
When I try to put my two graphs together to form a biplot, my scatter plot is displayed in a scale that is way too large compared to the correlation circle.
I would just like to merge the two charts together using this package.
Here is the code that allowed me to get the graph showing row principal coordinates:
Note: In order to propose a model to reproduce I use the iris dataset, resembling in form to my dataset.
import pandas as pd
import prince
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset = dataset.set_index('Class')
sc = StandardScaler(with_std=False)
dataset = pd.DataFrame(sc.fit_transform(dataset),
index=dataset.index,
columns=dataset.columns)
prince_pca = prince.PCA(n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=False,
copy=True,
check_input=True,
engine='auto',
random_state=42)
prince_pca = prince_pca.fit(dataset)
ax = prince_pca.plot_row_coordinates(dataset,
ax=None,
figsize=(10, 10),
x_component=0,
y_component=1,
labels=None,
color_labels=dataset.index,
ellipse_outline=True,
ellipse_fill=True,
show_points=True)
plt.show()
Here's the one I tinkered with to get my circle of correlations:
pcs = prince_pca.column_correlations(dataset)
pcs_0=pcs[0].to_numpy()
pcs_1=pcs[1].to_numpy()
pcs_coord = np.concatenate((pcs_0, pcs_1))
fig = plt.subplots(figsize=(10,10))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.quiver(np.zeros(pcs_0.shape[0]), np.zeros(pcs_1.shape[0]),
pcs_coord[:4], pcs_coord[4:], angles='xy', scale_units='xy', scale=1, color='r', width= 0.003)
for i, (x, y) in enumerate(zip(pcs_coord[:4], pcs_coord[4:])):
plt.text(x, y, pcs.index[i], fontsize=12)
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
plt.plot([-1,1],[0,0],color='silver',linestyle='--',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='--',linewidth=1)
plt.title("Correlation circle of variable", fontsize=22)
plt.xlabel('F{} ({}%)'.format(1, round(100*prince_pca.explained_inertia_[0],1)),
fontsize=14)
plt.ylabel('F{} ({}%)'.format(2, round(100*prince_pca.explained_inertia_[1],1)),
fontsize=14)
plt.show()
And finally here is the one that tries to bring together the circle of correlations as well as the main row coordinates graph from the "prince" package:
pcs = prince_pca.column_correlations(dataset)
pcs_0 = pcs[0].to_numpy()
pcs_1 = pcs[1].to_numpy()
pcs_coord = np.concatenate((pcs_0, pcs_1))
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, aspect="equal")
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.quiver(np.zeros(pcs_0.shape[0]),
np.zeros(pcs_1.shape[0]),
pcs_coord[:4],
pcs_coord[4:],
angles='xy',
scale_units='xy',
scale=1,
color='r',
width=0.003)
for i, (x, y) in enumerate(zip(pcs_coord[:4], pcs_coord[4:])):
plt.text(x, y, pcs.index[i], fontsize=12)
plt.scatter(
x=prince_pca.row_coordinates(dataset)[0],
y=prince_pca.row_coordinates(dataset)[1])
circle = plt.Circle((0, 0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
plt.plot([-1, 1], [0, 0], color='silver', linestyle='--', linewidth=1)
plt.plot([0, 0], [-1, 1], color='silver', linestyle='--', linewidth=1)
plt.title("Correlation circle of variable", fontsize=22)
plt.xlabel('F{} ({}%)'.format(1,
round(100 * prince_pca.explained_inertia_[0],
1)),
fontsize=14)
plt.ylabel('F{} ({}%)'.format(2,
round(100 * prince_pca.explained_inertia_[1],
1)),
fontsize=14)
plt.show()
Bonus question: how to explain that the PCA class of sklearn does not calculate the correct coordinates for my variables when they are centered but not scaled? Any method to overcome this?
Here is the circle of correlations obtained by creating the pca object with sklearn where the "length" and "margin_low" variables appear as orthogonal:
Here is the correlation matrix demonstrating the negative correlation between the "length" and "margin_low" variables:
I managed to mix the two graphs.
Here is the code to display the graph combining the circle of correlations and the scatter with the rows:
import pandas as pd
import prince
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
# Import dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# Preparing the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset = dataset.set_index('Class')
# Preprocessing: centered but not scaled
sc = StandardScaler(with_std=False)
dataset = pd.DataFrame(sc.fit_transform(dataset),
index=dataset.index,
columns=dataset.columns)
# PCA setting
prince_pca = prince.PCA(n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=False,
copy=True,
check_input=True,
engine='auto',
random_state=42)
# PCA fiting
prince_pca = prince_pca.fit(dataset)
# Component coordinates
pcs = prince_pca.column_correlations(dataset)
# Row coordinates
pca_row_coord = prince_pca.row_coordinates(dataset).to_numpy()
# Preparing the colors for parameter 'c'
colors = dataset.T
# Display row coordinates
ax = prince_pca.plot_row_coordinates(dataset,
figsize=(12, 12),
x_component=0,
y_component=1,
labels=None,
color_labels=dataset.index,
ellipse_outline=True,
ellipse_fill=True,
show_points=True)
# We plot the vectors
plt.quiver(np.zeros(pcs.to_numpy().shape[0]),
np.zeros(pcs.to_numpy().shape[0]),
pcs[0],
pcs[1],
angles='xy',
scale_units='xy',
scale=1,
color='r',
width=0.003)
# Display the names of the variables
for i, (x, y) in enumerate(zip(pcs[0], pcs[1])):
if x >= xmin and x <= xmax and y >= ymin and y <= ymax:
plt.text(x,
y,
prince_pca.column_correlations(dataset).index[i],
fontsize=16,
ha="center",
va="bottom",
color="red")
# Display a circle
circle = plt.Circle((0, 0),
1,
facecolor='none',
edgecolor='orange',
linewidth=1)
plt.gca().add_artist(circle)
# Title
plt.title("Row principal coordinates and circle of correlations", fontsize=22)
# Display the percentage of inertia on each axis
plt.xlabel('F{} ({}%)'.format(1,
round(100 * prince_pca.explained_inertia_[0],
1)),
fontsize=14)
plt.ylabel('F{} ({}%)'.format(2,
round(100 * prince_pca.explained_inertia_[1],
1)),
fontsize=14)
# Display the grid to better read the values ​​of the circle of correlations
plt.grid(visible=True)
plt.show()

Is there a library that will help me fit data easily? I found fitter and i will provide the code but it shows some errors

So, here is my code:
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator
from fitter import Fitter, get_common_distributions
df = pd.read_csv("project3.csv")
bins = [282.33, 594.33, 906.33, 1281.33, 15030.33, 1842.33, 2154.33, 2466.33, 2778.33, 3090.33, 3402.33]
#declaring
facecolor = '#EAEAEA'
color_bars = '#3475D0'
txt_color1 = '#252525'
txt_color2 = '#004C74'
fig, ax = plt.subplots(1, figsize=(16, 6), facecolor=facecolor)
ax.set_facecolor(facecolor)
n, bins, patches = plt.hist(df.City1, color=color_bars, bins=10)
#grid
minor_locator = AutoMinorLocator(2)
plt.gca().xaxis.set_minor_locator(minor_locator)
plt.grid(which='minor', color=facecolor, lw = 0.5)
xticks = [(bins[idx+1] + value)/2 for idx, value in enumerate(bins[:-1])]
xticks_labels = [ "{:.0f}-{:.0f}".format(value, bins[idx+1]) for idx, value in enumerate(bins[:-1])]
plt.xticks(xticks, labels=xticks_labels, c=txt_color1, fontsize=13)
#beautify
ax.tick_params(axis='x', which='both',length=0)
plt.yticks([])
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
for idx, value in enumerate(n):
if value > 0:
plt.text(xticks[idx], value+5, int(value), ha='center', fontsize=16, c=txt_color1)
plt.title('Histogram of rainfall in City1\n', loc = 'right', fontsize = 20, c=txt_color1)
plt.xlabel('\nCentimeters of rainfall', c=txt_color2, fontsize=14)
plt.ylabel('Frequency of occurrence', c=txt_color2, fontsize=14)
plt.tight_layout()
#plt.savefig('City1_Raw.png', facecolor=facecolor)
plt.show()
city1 = df['City1'].values
f = Fitter(city1, distributions=get_common_distributions())
f.fit()
fig = f.plot_pdf(names=None, Nbest=4, lw=1, method='sumsquare_error')
plt.show()
print(f.get_best(method = 'sumsquare_error'))
The issue is with the plots it shows. The first histogram it generates is
Next I get another graph with best fitted distributions which is
Then an output statement
{'chi2': {'df': 10.692966790090342, 'loc': 16.690849400411103, 'scale': 118.71595997157786}}
Process finished with exit code 0
I have a couple of questions. Why is chi2, the best fitted distribution not plotted on the graph?
How do I plot these distributions on top of the histograms and not separately? The hist() function in fitter library can do that but there I don't get to control the bins and so I end up getting like 100 bins with some flat looking data.
How do I solve this issue? I need to plot the best fit curve on the histogram that looks like image1. Can I use any other module/package to get the work done in similar way? This uses least squares fit but I am OK with least likelihood or log likelihood too.
Simple way of plotting things on top of each other (using some properties of the Fitter class)
import scipy.stats as st
import matplotlib.pyplot as plt
from fitter import Fitter, get_common_distributions
from scipy import stats
numberofpoints=50000
df = stats.norm.rvs( loc=1090, scale=500, size=numberofpoints)
fig, ax = plt.subplots(1, figsize=(16, 6))
n, bins, patches = ax.hist( df, bins=30, density=True)
f = Fitter(df, distributions=get_common_distributions())
f.fit()
errorlist = sorted(
[
[f._fitted_errors[dist], dist]
for dist in get_common_distributions()
]
)[:4]
for err, dist in errorlist:
ax.plot( f.x, f.fitted_pdf[dist] )
plt.show()
Using the histogram normalization, one would need to play with scaling to generalize again.

'numpy.ndarray' object has no attribute 'set_xlabel'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#reading data
data = pd.read_csv('Malicious_or_criminal_attacks_breakdown-Top_five_industry_sectors_July-Dec-2019.csv',index_col=0,engine='python')
df = pd.DataFrame(data)
#df list for data
df.values.tolist()
#construction of group bar chart
labels = ('Cyber incident', 'Theft of paperwork or data storagedevice', 'Rogue employee', 'Social engineering / impersonation')
colors = ['red', 'yellow', 'blue', 'green']
data = df.values.tolist()
arr = np.array(data)
n_groups, n_colors = arr.shape
width = 0.2
x_pos = np.arange(n_colors)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 5), dpi=100)
for i in range(n_groups):
plt.bar(x_pos + i*width, arr[i, :], width, align='center', label=labels[i], color=colors[i])
ax.set_xlabel("the top five industry sectors")
ax.set_ylabel("Number of attack")
ax.set_title("Type of attack by top five industry sectors")
ax.set_xticks(x_pos+width/2)
ax.set_xticklabels(colors)
ax.legend()
Can anyone tell me what im doing wrong here and why numpy isnt working as expected.Ive looked at documantation for hours and cant figure out whats wrong
Ax is an array of subplots because you created more than one. So in order to set the titles of the subplots, you need to iterate through them as well. You could fix this fairly easily like so:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 5), dpi=100)
for i in range(n_groups):
plt.bar(x_pos + i*width, arr[i, :], width, align='center', label=labels[i], color=colors[i])
for subplot in ax:
subplot.set_xlabel("the top five industry sectors")
subplot.set_ylabel("Number of attack")
subplot.set_title("Type of attack by top five industry sectors")
subplot.set_xticks(x_pos+width/2)
subplot.set_xticklabels(colors)
subplot.legend()
If you want to set different titles and stuff for different subplots, you need to adjust for that.

Why does not this plot get bigger as expected?

From a previous question, I got that plt.figure(figsize = 2 * np.array(plt.rcParams['figure.figsize'])) will increase the plot size by 2 times. With below code, I want to plot 4 subplots in the grid 2x2.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows = 2, ncols = 2)
plt.figure(figsize = 2 * np.array(plt.rcParams['figure.figsize'])) # This is to have bigger plot
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
Could you please explain why plt.figure(figsize = 2 * np.array(plt.rcParams['figure.figsize'])) does not work in this case?
I post #JohanC's comment to remove this question from unanswered list.
It could be written as fig, axes = plt.subplots(nrows=2, ncols=2, figsize=2 * np.array(plt.rcParams['figure.figsize'])). Just calling plt.figure without storing the result creates a dummy new figure, without changing fig and without creating the axes on that new figure won't have the desired result.

How to draw vertical average lines for overlapping histograms in a loop

I'm trying to draw with matplotlib two average vertical line for every overlapping histograms using a loop. I have managed to draw the first one, but I don't know how to draw the second one. I'm using two variables from a dataset to draw the histograms. One variable (feat) is categorical (0 - 1), and the other one (objective) is numerical. The code is the following:
for chas in df[feat].unique():
plt.hist(df.loc[df[feat] == chas, objective], bins = 15, alpha = 0.5, density = True, label = chas)
plt.axvline(df[objective].mean(), linestyle = 'dashed', linewidth = 2)
plt.title(objective)
plt.legend(loc = 'upper right')
I also have to add to the legend the mean and standard deviation values for each histogram.
How can I do it? Thank you in advance.
I recommend you using axes to plot your figure. Pls see code below and the artist tutorial here.
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
mu1, sigma1 = 100, 8
mu2, sigma2 = 150, 15
x1 = mu1 + sigma1 * np.random.randn(10000)
x2 = mu2 + sigma2 * np.random.randn(10000)
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
# the histogram of the data
lbs = ['a', 'b']
colors = ['r', 'g']
for i, x in enumerate([x1, x2]):
n, bins, patches = ax.hist(x, 50, density=True, facecolor=colors[i], alpha=0.75, label=lbs[i])
ax.axvline(bins.mean())
ax.legend()

Resources