Related
I have a dataset composed of data with the same unit of measurement. Before making my pca, I centered my data using sklearn.preprocessing.StandardScaler(with_std=False).
I don't understand why but using the sklearn.decomposition.PCA.fit_transform(<my_dataframe>) method when I want to display a correlation circle I get two perfectly represented orthogonal variables, thus indicating that they are independent, but they are not. With a correlation matrix I observe perfectly that they are anti-correlated.
Through dint of research I came across the "prince" package which manages to get the perfect coordinates of my centered but unscaled variables.
When I do my pca with it, I can perfectly display the projection of my lines. It also has the advantage of being able to display ellipses. The only problem is that there is no function for a bibplot.
I managed to display a circle of correlations using the column_correlations() method to get the coordinates of the variables. By tinkering here is what I managed to get:
When I try to put my two graphs together to form a biplot, my scatter plot is displayed in a scale that is way too large compared to the correlation circle.
I would just like to merge the two charts together using this package.
Here is the code that allowed me to get the graph showing row principal coordinates:
Note: In order to propose a model to reproduce I use the iris dataset, resembling in form to my dataset.
import pandas as pd
import prince
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset = dataset.set_index('Class')
sc = StandardScaler(with_std=False)
dataset = pd.DataFrame(sc.fit_transform(dataset),
index=dataset.index,
columns=dataset.columns)
prince_pca = prince.PCA(n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=False,
copy=True,
check_input=True,
engine='auto',
random_state=42)
prince_pca = prince_pca.fit(dataset)
ax = prince_pca.plot_row_coordinates(dataset,
ax=None,
figsize=(10, 10),
x_component=0,
y_component=1,
labels=None,
color_labels=dataset.index,
ellipse_outline=True,
ellipse_fill=True,
show_points=True)
plt.show()
Here's the one I tinkered with to get my circle of correlations:
pcs = prince_pca.column_correlations(dataset)
pcs_0=pcs[0].to_numpy()
pcs_1=pcs[1].to_numpy()
pcs_coord = np.concatenate((pcs_0, pcs_1))
fig = plt.subplots(figsize=(10,10))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.quiver(np.zeros(pcs_0.shape[0]), np.zeros(pcs_1.shape[0]),
pcs_coord[:4], pcs_coord[4:], angles='xy', scale_units='xy', scale=1, color='r', width= 0.003)
for i, (x, y) in enumerate(zip(pcs_coord[:4], pcs_coord[4:])):
plt.text(x, y, pcs.index[i], fontsize=12)
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
plt.plot([-1,1],[0,0],color='silver',linestyle='--',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='--',linewidth=1)
plt.title("Correlation circle of variable", fontsize=22)
plt.xlabel('F{} ({}%)'.format(1, round(100*prince_pca.explained_inertia_[0],1)),
fontsize=14)
plt.ylabel('F{} ({}%)'.format(2, round(100*prince_pca.explained_inertia_[1],1)),
fontsize=14)
plt.show()
And finally here is the one that tries to bring together the circle of correlations as well as the main row coordinates graph from the "prince" package:
pcs = prince_pca.column_correlations(dataset)
pcs_0 = pcs[0].to_numpy()
pcs_1 = pcs[1].to_numpy()
pcs_coord = np.concatenate((pcs_0, pcs_1))
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, aspect="equal")
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.quiver(np.zeros(pcs_0.shape[0]),
np.zeros(pcs_1.shape[0]),
pcs_coord[:4],
pcs_coord[4:],
angles='xy',
scale_units='xy',
scale=1,
color='r',
width=0.003)
for i, (x, y) in enumerate(zip(pcs_coord[:4], pcs_coord[4:])):
plt.text(x, y, pcs.index[i], fontsize=12)
plt.scatter(
x=prince_pca.row_coordinates(dataset)[0],
y=prince_pca.row_coordinates(dataset)[1])
circle = plt.Circle((0, 0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
plt.plot([-1, 1], [0, 0], color='silver', linestyle='--', linewidth=1)
plt.plot([0, 0], [-1, 1], color='silver', linestyle='--', linewidth=1)
plt.title("Correlation circle of variable", fontsize=22)
plt.xlabel('F{} ({}%)'.format(1,
round(100 * prince_pca.explained_inertia_[0],
1)),
fontsize=14)
plt.ylabel('F{} ({}%)'.format(2,
round(100 * prince_pca.explained_inertia_[1],
1)),
fontsize=14)
plt.show()
Bonus question: how to explain that the PCA class of sklearn does not calculate the correct coordinates for my variables when they are centered but not scaled? Any method to overcome this?
Here is the circle of correlations obtained by creating the pca object with sklearn where the "length" and "margin_low" variables appear as orthogonal:
Here is the correlation matrix demonstrating the negative correlation between the "length" and "margin_low" variables:
I managed to mix the two graphs.
Here is the code to display the graph combining the circle of correlations and the scatter with the rows:
import pandas as pd
import prince
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
# Import dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# Preparing the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset = dataset.set_index('Class')
# Preprocessing: centered but not scaled
sc = StandardScaler(with_std=False)
dataset = pd.DataFrame(sc.fit_transform(dataset),
index=dataset.index,
columns=dataset.columns)
# PCA setting
prince_pca = prince.PCA(n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=False,
copy=True,
check_input=True,
engine='auto',
random_state=42)
# PCA fiting
prince_pca = prince_pca.fit(dataset)
# Component coordinates
pcs = prince_pca.column_correlations(dataset)
# Row coordinates
pca_row_coord = prince_pca.row_coordinates(dataset).to_numpy()
# Preparing the colors for parameter 'c'
colors = dataset.T
# Display row coordinates
ax = prince_pca.plot_row_coordinates(dataset,
figsize=(12, 12),
x_component=0,
y_component=1,
labels=None,
color_labels=dataset.index,
ellipse_outline=True,
ellipse_fill=True,
show_points=True)
# We plot the vectors
plt.quiver(np.zeros(pcs.to_numpy().shape[0]),
np.zeros(pcs.to_numpy().shape[0]),
pcs[0],
pcs[1],
angles='xy',
scale_units='xy',
scale=1,
color='r',
width=0.003)
# Display the names of the variables
for i, (x, y) in enumerate(zip(pcs[0], pcs[1])):
if x >= xmin and x <= xmax and y >= ymin and y <= ymax:
plt.text(x,
y,
prince_pca.column_correlations(dataset).index[i],
fontsize=16,
ha="center",
va="bottom",
color="red")
# Display a circle
circle = plt.Circle((0, 0),
1,
facecolor='none',
edgecolor='orange',
linewidth=1)
plt.gca().add_artist(circle)
# Title
plt.title("Row principal coordinates and circle of correlations", fontsize=22)
# Display the percentage of inertia on each axis
plt.xlabel('F{} ({}%)'.format(1,
round(100 * prince_pca.explained_inertia_[0],
1)),
fontsize=14)
plt.ylabel('F{} ({}%)'.format(2,
round(100 * prince_pca.explained_inertia_[1],
1)),
fontsize=14)
# Display the grid to better read the values of the circle of correlations
plt.grid(visible=True)
plt.show()
I'm trying to draw with matplotlib two average vertical line for every overlapping histograms using a loop. I have managed to draw the first one, but I don't know how to draw the second one. I'm using two variables from a dataset to draw the histograms. One variable (feat) is categorical (0 - 1), and the other one (objective) is numerical. The code is the following:
for chas in df[feat].unique():
plt.hist(df.loc[df[feat] == chas, objective], bins = 15, alpha = 0.5, density = True, label = chas)
plt.axvline(df[objective].mean(), linestyle = 'dashed', linewidth = 2)
plt.title(objective)
plt.legend(loc = 'upper right')
I also have to add to the legend the mean and standard deviation values for each histogram.
How can I do it? Thank you in advance.
I recommend you using axes to plot your figure. Pls see code below and the artist tutorial here.
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
mu1, sigma1 = 100, 8
mu2, sigma2 = 150, 15
x1 = mu1 + sigma1 * np.random.randn(10000)
x2 = mu2 + sigma2 * np.random.randn(10000)
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
# the histogram of the data
lbs = ['a', 'b']
colors = ['r', 'g']
for i, x in enumerate([x1, x2]):
n, bins, patches = ax.hist(x, 50, density=True, facecolor=colors[i], alpha=0.75, label=lbs[i])
ax.axvline(bins.mean())
ax.legend()
I am plotting the following data:
fig, ax = plt.subplots()
im = ax.scatter(std_sorted[:, [1]], std_sorted[:, [2]], s=5, c=std_sorted[:, [0]])
With the following result:
My question is: can I fill the space between each point in the plot by extrapolating and then coloring that extrapolated space accordingly, so I get a uniform plot without any points?
So basically I'm looking for this result (This is simply me "squeezing" the above picture to show the desired result and not dealing with the space between the points):
The simplest thing to do in this case, is to use a short vertical line a marker, and set the markersize large enough such that there is no white space left.
Another option is to use tricontourf to create a filled image of x, y and z.
Note that neither a scatter plot nor tricontourf need the points to be sorted in any order.
If you do have your points sorted into an orderded grid, plt.imshow should give the best result.
Here is some code to show how it could look like. First some dummy data slightly similar to the example are generated. As the x,y are random, they don't fill the complete space. This might leave some blank spots in the scatter plot. The spots are nicely interpolated for the contourf, except possibly in the corners.
import numpy as np
import matplotlib.pyplot as plt
N = 50000
xmin = 0
xmax = 0.20
ymin = -0.01
ymax = 0.01
std_sorted = np.zeros((N, 3))
std_sorted[:,1] = np.random.uniform(xmin, xmax, N)
std_sorted[:,2] = np.random.choice(np.linspace(ymin, ymax, 80), N)
std_sorted[:,0] = np.cos(3*(std_sorted[:,1] - 0.04 - 100*std_sorted[:,2]**2))**10
fig, ax = plt.subplots(ncols=2)
# im = ax[0].scatter(std_sorted[:, 1], std_sorted[:, 2], s=20, c=std_sorted[:, 0], marker='|')
im = ax[0].scatter(std_sorted[:, 1], std_sorted[:, 2], s=5, c=std_sorted[:, 0], marker='.')
ax[0].set_xlim(xmin, xmax)
ax[0].set_ylim(ymin, ymax)
ax[0].set_title("scatter plot")
ax[1].tricontourf(std_sorted[:, 1], std_sorted[:, 2], std_sorted[:, 0], 256)
ax[1].set_title("tricontourf")
plt.tight_layout()
plt.show()
I have a dataset like
x = 3,4,6,77,3
y = 8,5,2,5,5
labels = "null","exit","power","smile","null"
Then I use
from matplotlib import pyplot as plt
plt.scatter(x,y)
colorbar = plt.colorbar(labels)
plt.show()
to make a scatter plot, but cannot make colorbar showing labels as its colors.
How to get this?
I'm not sure, if it's a good idea to do that for scatter plots in general (you have the same description for different data points, maybe just use some legend here?), but I guess a specific solution to what you have in mind, might be the following:
from matplotlib import pyplot as plt
# Data
x = [3, 4, 6, 77, 3]
y = [8, 5, 2, 5, 5]
labels = ('null', 'exit', 'power', 'smile', 'null')
# Customize colormap and scatter plot
cm = plt.cm.get_cmap('hsv')
sc = plt.scatter(x, y, c=range(5), cmap=cm)
cbar = plt.colorbar(sc, ticks=range(5))
cbar.ax.set_yticklabels(labels)
plt.show()
This will result in such an output:
The code combines this Matplotlib demo and this SO answer.
Hope that helps!
EDIT: Incorporating the comments, I can only think of some kind of label color dictionary, generating a custom colormap from the colors, and before plotting explicitly grabbing the proper color indices from the labels.
Here's the updated code (I added some additional colors and data points to check scalability):
from matplotlib import pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import numpy as np
# Color information; create custom colormap
label_color_dict = {'null': '#FF0000',
'exit': '#00FF00',
'power': '#0000FF',
'smile': '#FF00FF',
'addon': '#AAAAAA',
'addon2': '#444444'}
all_labels = list(label_color_dict.keys())
all_colors = list(label_color_dict.values())
n_colors = len(all_colors)
cm = LinearSegmentedColormap.from_list('custom_colormap', all_colors, N=n_colors)
# Data
x = [3, 4, 6, 77, 3, 10, 40]
y = [8, 5, 2, 5, 5, 4, 7]
labels = ('null', 'exit', 'power', 'smile', 'null', 'addon', 'addon2')
# Get indices from color list for given labels
color_idx = [all_colors.index(label_color_dict[label]) for label in labels]
# Customize colorbar and plot
sc = plt.scatter(x, y, c=color_idx, cmap=cm)
c_ticks = np.arange(n_colors) * (n_colors / (n_colors + 1)) + (2 / n_colors)
cbar = plt.colorbar(sc, ticks=c_ticks)
cbar.ax.set_yticklabels(all_labels)
plt.show()
And, the new output:
Finding the correct middle point of each color segment is (still) not good, but I'll leave this optimization to you.
I am trying to set the size of the points in accordance with the value of a column that represents their labels but I am getting an empty plot.
Moreover I wonder how I can set the size of the points uniformly (i.e. regardless of the value of the third column).
For a reproducible example:
plot_data.to_json()
'{"x1":{"0":-0.2019455769,"1":0.1350610218,"2":-0.1128417956,"3":-0.1481016799,"4":0.1293273221,"5":-0.0266437776,"6":0.0100572041,"7":0.0037355635,"8":-0.0203400136,"9":0.1363267107},"x2":{"0":-0.1938001473,"1":-0.1353617432,"2":-0.0381057072,"3":-0.0874488661,"4":-0.2152329772,"5":0.0275324833,"6":-0.174604808,"7":-0.1872132566,"8":0.1172552524,"9":0.0166454137},"label":{"0":1,"1":0,"2":1,"3":0,"4":0,"5":1,"6":0,"7":0,"8":1,"9":0}}'
plt.figure(figsize = (20, 10))
sns.scatterplot(x ='x1', y='x2', hue = 'label', size = 'label', sizes = {0:1, 1:3} , data = plot_data)
plt.axis('equal')
plt.show()
Your code was quite close: the sizes were just too small to make the points easily visible. Here code with sizes=(40, 40), which makes the minimum and maximum size the same (see docs) and gives uniform point size:
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt
plot_data = pd.read_json('{"x1":{"0":-0.2019455769,"1":0.1350610218,"2":-0.1128417956,"3":-0.1481016799,"4":0.1293273221,"5":-0.0266437776,"6":0.0100572041,"7":0.0037355635,"8":-0.0203400136,"9":0.1363267107},"x2":{"0":-0.1938001473,"1":-0.1353617432,"2":-0.0381057072,"3":-0.0874488661,"4":-0.2152329772,"5":0.0275324833,"6":-0.174604808,"7":-0.1872132566,"8":0.1172552524,"9":0.0166454137},"label":{"0":1,"1":0,"2":1,"3":0,"4":0,"5":1,"6":0,"7":0,"8":1,"9":0}}')
plt.figure(figsize = (10, 5))
sns.scatterplot(x='x1', y='x2', hue='label', size='label', sizes=(40, 40),
data=plot_data)
plt.axis('equal')
plt.show()
Here the result: