Get centroid of scatter plot - python-3.x

I have generated this scatter plot via the plotting of the first two PCA elements from a feature extraction...PCA1 and PCA2.
The plot shown above is for 3 classes and with PCA1 (x-axis) vs PCA2 (y-axis). I have generated the plot as follow:
target_names = ['class_1', 'class_2', 'class_3']
plt.figure(figsize=(11, 8))
Xt = pca.fit_transform(X)
plot = plt.scatter(Xt[:,0], Xt[:,1], c=y, cmap=plt.cm.jet,
s=30, linewidths=0, alpha=0.7)
#centers = kmeans.cluster_centers_
#plt.scatter(centers[:, 0], centers[:, 1], c=['black', 'green', 'red'], marker='^', s=100, #alpha=0.5);
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.show()
I wanted to know how to correctly get the centroid of each of the classes from the plot.
Here are the first few columns of the data:
Xt1 Xt2 y
-107.988187 -23.70121 1
-128.578852 -20.222378 1
-124.522967 -25.298283 1
-96.222918 -25.028239 1
-95.152954 -23.94496 1
-113.275804 -26.563129 1
-101.803 -24.22359 1
-94.662469 -22.94211 1
-104.118882 -24.037226 1
439.765098 -101.532469 2
50.100362 -34.278841 2
-69.229603 62.178599 2
-60.915475 53.296491 2
64.797364 91.991527 2
-112.815192 0.263505 0
-91.287067 -25.207217 0
-74.181941 -2.457892 0
-83.273718 -0.608004 0
-100.881393 -22.387571 0
-107.861711 -15.848869 0
-85.866992 -18.79126 0
-53.96314 -28.885316 0
-59.195432 -3.373361 0
Any help will be greatly appreciated.

Assuming that y is an array of labels corresponding to the rows of X (and therefore Xt), we can create a data frame with Xt[:, :2] and y and then use groupby('y') to aggregate the mean values for Xt[:, 0] and Xt[:, 1] for each value of y:
import pandas as pd
df = pd.DataFrame(Xt[:, :2], columns=['Xt1', 'Xt2'])
df['y'] = y
df.groupby('y').mean()
This will produce the means of Xt[:, 0] and Xt[:, 1] for each label in y, which are the centroid coordinates of each label in y in the first two principal components of the data.
With the snippet of data that the OP provided, the following script computes the centroids and overlays them on the plot as 'X's of the same color as the data:
df = pd.DataFrame(Xt[:, :2], columns=['Xt1', 'Xt2'])
df['y'] = y
df_centroid = df.groupby('y').mean().reset_index()
target_names = ['class_1', 'class_2', 'class_3']
plt.figure(figsize=(11, 8))
plot = plt.scatter(Xt[:, 0], Xt[:, 1], c=y, cmap=plt.cm.jet,
s=30, linewidths=0, alpha=0.5)
# Overlays the centroids on the plot as 'X'
plt.scatter(df_centroid.Xt1, df_centroid.Xt2, marker='x', s=60,
c=df_centroid.y, cmap=plt.cm.jet)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.show()

Related

How to draw vertical average lines for overlapping histograms in a loop

I'm trying to draw with matplotlib two average vertical line for every overlapping histograms using a loop. I have managed to draw the first one, but I don't know how to draw the second one. I'm using two variables from a dataset to draw the histograms. One variable (feat) is categorical (0 - 1), and the other one (objective) is numerical. The code is the following:
for chas in df[feat].unique():
plt.hist(df.loc[df[feat] == chas, objective], bins = 15, alpha = 0.5, density = True, label = chas)
plt.axvline(df[objective].mean(), linestyle = 'dashed', linewidth = 2)
plt.title(objective)
plt.legend(loc = 'upper right')
I also have to add to the legend the mean and standard deviation values for each histogram.
How can I do it? Thank you in advance.
I recommend you using axes to plot your figure. Pls see code below and the artist tutorial here.
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
mu1, sigma1 = 100, 8
mu2, sigma2 = 150, 15
x1 = mu1 + sigma1 * np.random.randn(10000)
x2 = mu2 + sigma2 * np.random.randn(10000)
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
# the histogram of the data
lbs = ['a', 'b']
colors = ['r', 'g']
for i, x in enumerate([x1, x2]):
n, bins, patches = ax.hist(x, 50, density=True, facecolor=colors[i], alpha=0.75, label=lbs[i])
ax.axvline(bins.mean())
ax.legend()

Plot Histogram on different axes

I am reading CSV file:
Notation Level RFResult PRIResult PDResult Total Result
AAA 1 1.23 0 2 3.23
AAA 1 3.4 1 0 4.4
BBB 2 0.26 1 1.42 2.68
BBB 2 0.73 1 1.3 3.03
CCC 3 0.30 0 2.73 3.03
DDD 4 0.25 1 1.50 2.75
AAA 5 0.25 1 1.50 2.75
FFF 6 0.26 1 1.42 2.68
...
...
Here is the code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.rad_csv('home\NewFiles\Files.csv')
Notation = df['Notation']
Level = df['Level']
RFResult = df['RFResult']
PRIResult = df['PRIResult']
PDResult = df['PDResult']
fig, axes = plt.subplots(nrows=7, ncols=1)
ax1, ax2, ax3, ax4, ax5, ax6, ax7 = axes.flatten()
n_bins = 13
ax1.hist(data['Total'], n_bins, histtype='bar') #Current this shows all Total Results in one plot
plt.show()
I want to show each Level Total Result in each different axes like as follow:
ax1 will show Level 1 Total Result
ax2 will show Level 2 Total Result
ax3 will show Level 3 Total Result
ax4 will show Level 4 Total Result
ax5 will show Level 5 Total Result
ax6 will show Level 6 Total Result
ax7 will show Level 7 Total Result
You can select a filtered part of a dataframe just by indexing: df[df['Level'] == level]['Total']. You can loop through the axes using for ax in axes.flatten(). To also get the index, use for ind, ax in enumerate(axes.flatten()). Note that Python normally starts counting from 1, so adding 1 to the index would be a good choice to indicate the level.
Note that when you have backslashes in a string, you can escape them using an r-string: r'home\NewFiles\Files.csv'.
The default ylim is from 0 to the maximum bar height, plus some padding. This can be changed for each ax separately. In the example below a list of ymax values is used to show the principle.
ax.grid(True, axis='both) sets the grid on for that ax. Instead of 'both', also 'x' or 'y' can be used to only set the grid for that axis. A grid line is drawn for each tick value. (The example below tries to use little space, so only a few gridlines are visible.)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N), 'Total': np.random.uniform(1, 5, N)})
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
ymax_per_level = [27, 29, 28, 26, 27]
for ind, (ax, lev_ymax) in enumerate(zip(axes.flatten(), ymax_per_level)):
level = ind + 1
n_bins = 13
ax.hist(df[df['Level'] == level]['Total'], bins=n_bins, histtype='bar')
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.set_ylim(0, lev_ymax)
ax.grid(True, axis='both')
plt.show()
PS: A stacked histogram with custom legend and custom vertical lines could be created as:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N),
'RFResult': np.random.uniform(1, 5, N),
'PRIResult': np.random.uniform(1, 5, N),
'PDResult': np.random.uniform(1, 5, N)})
df['Total'] = df['RFResult'] + df['PRIResult'] + df['PDResult']
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
colors = ['crimson', 'limegreen', 'dodgerblue']
column_names = ['RFResult', 'PRIResult', 'PDResult']
level_vertical_line = [1, 2, 3, 4, 5]
for level, (ax, vertical_line) in enumerate(zip(axes.flatten(), level_vertical_line), start=1):
n_bins = 13
level_data = df[df['Level'] == level][column_names].to_numpy()
# vertical_line = level_data.mean()
ax.hist(level_data, bins=n_bins,
histtype='bar', stacked=True, color=colors)
ax.axvline(vertical_line, color='gold', ls=':', lw=2)
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.margins(x=0.01)
ax.grid(True, axis='both')
legend_handles = [Patch(color=color) for color in colors]
axes[0].legend(legend_handles, column_names, ncol=len(column_names), loc='lower center', bbox_to_anchor=(0.5, 1.02))
plt.show()

Matplotlib - Scatter Plot, How to fill in the space between each individual point?

I am plotting the following data:
fig, ax = plt.subplots()
im = ax.scatter(std_sorted[:, [1]], std_sorted[:, [2]], s=5, c=std_sorted[:, [0]])
With the following result:
My question is: can I fill the space between each point in the plot by extrapolating and then coloring that extrapolated space accordingly, so I get a uniform plot without any points?
So basically I'm looking for this result (This is simply me "squeezing" the above picture to show the desired result and not dealing with the space between the points):
The simplest thing to do in this case, is to use a short vertical line a marker, and set the markersize large enough such that there is no white space left.
Another option is to use tricontourf to create a filled image of x, y and z.
Note that neither a scatter plot nor tricontourf need the points to be sorted in any order.
If you do have your points sorted into an orderded grid, plt.imshow should give the best result.
Here is some code to show how it could look like. First some dummy data slightly similar to the example are generated. As the x,y are random, they don't fill the complete space. This might leave some blank spots in the scatter plot. The spots are nicely interpolated for the contourf, except possibly in the corners.
import numpy as np
import matplotlib.pyplot as plt
N = 50000
xmin = 0
xmax = 0.20
ymin = -0.01
ymax = 0.01
std_sorted = np.zeros((N, 3))
std_sorted[:,1] = np.random.uniform(xmin, xmax, N)
std_sorted[:,2] = np.random.choice(np.linspace(ymin, ymax, 80), N)
std_sorted[:,0] = np.cos(3*(std_sorted[:,1] - 0.04 - 100*std_sorted[:,2]**2))**10
fig, ax = plt.subplots(ncols=2)
# im = ax[0].scatter(std_sorted[:, 1], std_sorted[:, 2], s=20, c=std_sorted[:, 0], marker='|')
im = ax[0].scatter(std_sorted[:, 1], std_sorted[:, 2], s=5, c=std_sorted[:, 0], marker='.')
ax[0].set_xlim(xmin, xmax)
ax[0].set_ylim(ymin, ymax)
ax[0].set_title("scatter plot")
ax[1].tricontourf(std_sorted[:, 1], std_sorted[:, 2], std_sorted[:, 0], 256)
ax[1].set_title("tricontourf")
plt.tight_layout()
plt.show()

How to plot values on a Python Basemap?

Is it possible to plot values on a basemap?
Let's say I have 3 lists of data.
lat = [50.3, 62.1, 41.4, ...]
lon = [12.4, 14.3, 3.5, ...]
val = [3, 5.4, 7.4, ...]
I've created a simple basemap:
def create_map(ax=None, lllon=6.00, lllat=47.0, urlon=16.00, urlat=55.10):
m = Basemap(llcrnrlon=lllon, llcrnrlat=lllat, \
urcrnrlon=urlon, urcrnrlat=urlat, \
resolution='h', \
projection='tmerc', \
lon_0=(lllon+urlon)/2, lat_0=(lllat+urlat)/2)
m.drawcoastlines()
m.drawcountries()
m.drawrivers()
return m
Now I want to plot the values of the "val" list on this map depending of their coordinates:
m = create_map()
x, y = m(lon,lat)
m.scatter(x, y, val) # somthing like that
plt.show()
Well, i already figured out that basemap is unable to plot 3d values, but is there a way to realize it?
The short, sweet, and simple answer to your first question is yes, you can plot using basemap (here's the documentation for it).
If you're looking to plot in 3d, there is documentation that explains how to plot using Basemap. Here's a simple script to get you started:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.close('all')
fig = plt.figure()
ax = fig.gca(projection='3d')
extent = [-127, -65, 25, 51]
# make the map and axis.
m = Basemap(llcrnrlon=extent[0], llcrnrlat=extent[2],
urcrnrlon=extent[1], urcrnrlat=extent[3],
projection='cyl', resolution='l', fix_aspect=False, ax=ax)
ax.add_collection3d(m.drawcoastlines(linewidth=0.25))
ax.add_collection3d(m.drawcountries(linewidth=0.25))
ax.add_collection3d(m.drawstates(linewidth=0.25))
ax.view_init(azim = 230, elev = 15)
ax.set_xlabel(u'Longitude (°E)', labelpad=10)
ax.set_ylabel(u'Latitude (°N)', labelpad=10)
ax.set_zlabel(u'Altitude (ft)', labelpad=20)
# values to plot - change as needed. Plots 2 dots, one at elevation 0 and another 100.
# also draws a line between the two.
x, y = m(-85.4808, 32.6099)
ax.plot3D([x, x], [y, y], [0, 100], color = 'green', lw = 0.5)
ax.scatter3D(x, y, 100, s = 5, c = 'k', zorder = 4)
ax.scatter3D(x, y, 0, s = 2, c = 'k', zorder = 4)
ax.set_zlim(0., 400.)
plt.show()

Why does contourf (matplotlib) switch x and y coordinates?

I am trying to get contourf to plot my stuff right, but it seems to switch the x and y coordinates. In the example below, I show this by evaluating a 2d Gaussian function that has different widths in x and y directions. With the values given, the width in y direction should be larger. Here is the script:
from numpy import *
from matplotlib.pyplot import *
xMax = 50
xNum = 100
w0x = 10
w0y = 15
dx = xMax/xNum
xGrid = linspace(-xMax/2+dx/2, xMax/2-dx/2, xNum, endpoint=True)
yGrid = xGrid
Int = zeros((xNum, xNum))
for idX in range(xNum):
for idY in range(xNum):
Int[idX, idY] = exp(-((xGrid[idX]/w0x)**2 + (yGrid[idY]/(w0y))**2))
fig = figure(6)
clf()
ax = subplot(2,1,1)
X, Y = meshgrid(xGrid, yGrid)
contour(X, Y, Int, colors='k')
plot(array([-xMax, xMax])/2, array([0, 0]), '-b')
plot(array([0, 0]), array([-xMax, xMax])/2, '-r')
ax.set_aspect('equal')
xlabel("x")
ylabel("y")
subplot(2,1,2)
plot(xGrid, Int[:, int(xNum/2)], '-b', label='I(x, y=max/2)')
plot(xGrid, Int[int(xNum/2), :], '-r', label='I(x=max/2, y)')
ax.set_aspect('equal')
legend()
xlabel(r"x or y")
ylabel(r"I(x or y)")
The figure thrown out is this:
On top the contour plot which has the larger width in x direction (not y). Below are slices shown, one across x direction (at constant y=0, blue), the other in y direction (at constant x=0, red). Here, everything seems fine, the y direction is broader than the x direction. So why would I have to transpose the array in order to have it plotted as I want? This seems unintuitive to me and not in agreement with the documentation.
It helps if you think of a 2D array's shape not as (x, y) but as (rows, columns), because that is how most math routines interpret them - including matplotlib's 2D plotting functions. Therefore, the first dimension is vertical (which you call y) and the second dimension is horizontal (which you call x).
Note that this convention is very prominent, even in numpy. The function np.vstack is supposed to concatenate arrays vertically works along the first dimension and np.hstack works horizontally on the second dimension.
To illustrate the point:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([[0, 0, 1, 0, 0],
[0, 1, 1, 1, 0],
[1, 1, 1, 1, 1]])
a[:, 2] = 2 # set column
print(a)
plt.imshow(a)
plt.contour(a, colors='k')
This prints
[[0 0 2 0 0]
[0 1 2 1 0]
[1 1 2 1 1]]
and consistently plots
According to your convention that an array is (x, y) the command a[:, 2] = 2 should have assigned to the third row, but numpy and matplotlib both agree that it was the column :)
You can of course use your own convention how to interpret the dimensions of your arrays, but in the long run it will be more consistent to treat them as (y, x).

Resources