I am reading CSV file:
Notation Level RFResult PRIResult PDResult Total Result
AAA 1 1.23 0 2 3.23
AAA 1 3.4 1 0 4.4
BBB 2 0.26 1 1.42 2.68
BBB 2 0.73 1 1.3 3.03
CCC 3 0.30 0 2.73 3.03
DDD 4 0.25 1 1.50 2.75
AAA 5 0.25 1 1.50 2.75
FFF 6 0.26 1 1.42 2.68
...
...
Here is the code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.rad_csv('home\NewFiles\Files.csv')
Notation = df['Notation']
Level = df['Level']
RFResult = df['RFResult']
PRIResult = df['PRIResult']
PDResult = df['PDResult']
fig, axes = plt.subplots(nrows=7, ncols=1)
ax1, ax2, ax3, ax4, ax5, ax6, ax7 = axes.flatten()
n_bins = 13
ax1.hist(data['Total'], n_bins, histtype='bar') #Current this shows all Total Results in one plot
plt.show()
I want to show each Level Total Result in each different axes like as follow:
ax1 will show Level 1 Total Result
ax2 will show Level 2 Total Result
ax3 will show Level 3 Total Result
ax4 will show Level 4 Total Result
ax5 will show Level 5 Total Result
ax6 will show Level 6 Total Result
ax7 will show Level 7 Total Result
You can select a filtered part of a dataframe just by indexing: df[df['Level'] == level]['Total']. You can loop through the axes using for ax in axes.flatten(). To also get the index, use for ind, ax in enumerate(axes.flatten()). Note that Python normally starts counting from 1, so adding 1 to the index would be a good choice to indicate the level.
Note that when you have backslashes in a string, you can escape them using an r-string: r'home\NewFiles\Files.csv'.
The default ylim is from 0 to the maximum bar height, plus some padding. This can be changed for each ax separately. In the example below a list of ymax values is used to show the principle.
ax.grid(True, axis='both) sets the grid on for that ax. Instead of 'both', also 'x' or 'y' can be used to only set the grid for that axis. A grid line is drawn for each tick value. (The example below tries to use little space, so only a few gridlines are visible.)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N), 'Total': np.random.uniform(1, 5, N)})
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
ymax_per_level = [27, 29, 28, 26, 27]
for ind, (ax, lev_ymax) in enumerate(zip(axes.flatten(), ymax_per_level)):
level = ind + 1
n_bins = 13
ax.hist(df[df['Level'] == level]['Total'], bins=n_bins, histtype='bar')
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.set_ylim(0, lev_ymax)
ax.grid(True, axis='both')
plt.show()
PS: A stacked histogram with custom legend and custom vertical lines could be created as:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N),
'RFResult': np.random.uniform(1, 5, N),
'PRIResult': np.random.uniform(1, 5, N),
'PDResult': np.random.uniform(1, 5, N)})
df['Total'] = df['RFResult'] + df['PRIResult'] + df['PDResult']
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
colors = ['crimson', 'limegreen', 'dodgerblue']
column_names = ['RFResult', 'PRIResult', 'PDResult']
level_vertical_line = [1, 2, 3, 4, 5]
for level, (ax, vertical_line) in enumerate(zip(axes.flatten(), level_vertical_line), start=1):
n_bins = 13
level_data = df[df['Level'] == level][column_names].to_numpy()
# vertical_line = level_data.mean()
ax.hist(level_data, bins=n_bins,
histtype='bar', stacked=True, color=colors)
ax.axvline(vertical_line, color='gold', ls=':', lw=2)
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.margins(x=0.01)
ax.grid(True, axis='both')
legend_handles = [Patch(color=color) for color in colors]
axes[0].legend(legend_handles, column_names, ncol=len(column_names), loc='lower center', bbox_to_anchor=(0.5, 1.02))
plt.show()
Related
I have generated this scatter plot via the plotting of the first two PCA elements from a feature extraction...PCA1 and PCA2.
The plot shown above is for 3 classes and with PCA1 (x-axis) vs PCA2 (y-axis). I have generated the plot as follow:
target_names = ['class_1', 'class_2', 'class_3']
plt.figure(figsize=(11, 8))
Xt = pca.fit_transform(X)
plot = plt.scatter(Xt[:,0], Xt[:,1], c=y, cmap=plt.cm.jet,
s=30, linewidths=0, alpha=0.7)
#centers = kmeans.cluster_centers_
#plt.scatter(centers[:, 0], centers[:, 1], c=['black', 'green', 'red'], marker='^', s=100, #alpha=0.5);
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.show()
I wanted to know how to correctly get the centroid of each of the classes from the plot.
Here are the first few columns of the data:
Xt1 Xt2 y
-107.988187 -23.70121 1
-128.578852 -20.222378 1
-124.522967 -25.298283 1
-96.222918 -25.028239 1
-95.152954 -23.94496 1
-113.275804 -26.563129 1
-101.803 -24.22359 1
-94.662469 -22.94211 1
-104.118882 -24.037226 1
439.765098 -101.532469 2
50.100362 -34.278841 2
-69.229603 62.178599 2
-60.915475 53.296491 2
64.797364 91.991527 2
-112.815192 0.263505 0
-91.287067 -25.207217 0
-74.181941 -2.457892 0
-83.273718 -0.608004 0
-100.881393 -22.387571 0
-107.861711 -15.848869 0
-85.866992 -18.79126 0
-53.96314 -28.885316 0
-59.195432 -3.373361 0
Any help will be greatly appreciated.
Assuming that y is an array of labels corresponding to the rows of X (and therefore Xt), we can create a data frame with Xt[:, :2] and y and then use groupby('y') to aggregate the mean values for Xt[:, 0] and Xt[:, 1] for each value of y:
import pandas as pd
df = pd.DataFrame(Xt[:, :2], columns=['Xt1', 'Xt2'])
df['y'] = y
df.groupby('y').mean()
This will produce the means of Xt[:, 0] and Xt[:, 1] for each label in y, which are the centroid coordinates of each label in y in the first two principal components of the data.
With the snippet of data that the OP provided, the following script computes the centroids and overlays them on the plot as 'X's of the same color as the data:
df = pd.DataFrame(Xt[:, :2], columns=['Xt1', 'Xt2'])
df['y'] = y
df_centroid = df.groupby('y').mean().reset_index()
target_names = ['class_1', 'class_2', 'class_3']
plt.figure(figsize=(11, 8))
plot = plt.scatter(Xt[:, 0], Xt[:, 1], c=y, cmap=plt.cm.jet,
s=30, linewidths=0, alpha=0.5)
# Overlays the centroids on the plot as 'X'
plt.scatter(df_centroid.Xt1, df_centroid.Xt2, marker='x', s=60,
c=df_centroid.y, cmap=plt.cm.jet)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.show()
I'm trying to draw with matplotlib two average vertical line for every overlapping histograms using a loop. I have managed to draw the first one, but I don't know how to draw the second one. I'm using two variables from a dataset to draw the histograms. One variable (feat) is categorical (0 - 1), and the other one (objective) is numerical. The code is the following:
for chas in df[feat].unique():
plt.hist(df.loc[df[feat] == chas, objective], bins = 15, alpha = 0.5, density = True, label = chas)
plt.axvline(df[objective].mean(), linestyle = 'dashed', linewidth = 2)
plt.title(objective)
plt.legend(loc = 'upper right')
I also have to add to the legend the mean and standard deviation values for each histogram.
How can I do it? Thank you in advance.
I recommend you using axes to plot your figure. Pls see code below and the artist tutorial here.
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
mu1, sigma1 = 100, 8
mu2, sigma2 = 150, 15
x1 = mu1 + sigma1 * np.random.randn(10000)
x2 = mu2 + sigma2 * np.random.randn(10000)
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
# the histogram of the data
lbs = ['a', 'b']
colors = ['r', 'g']
for i, x in enumerate([x1, x2]):
n, bins, patches = ax.hist(x, 50, density=True, facecolor=colors[i], alpha=0.75, label=lbs[i])
ax.axvline(bins.mean())
ax.legend()
I have a dataset that represents male and female in binary. Males are represented as 0 while females are represented as 1. What i hope to do is to change 0 to Male, and 1 to Female in the plot legend. I tried to follow this post, but it didn't work out.
It gives me an error message that looks like this:
AttributeError Traceback (most recent call last)
<ipython-input-11-b3c99d4311ab> in <module>
23 # plot the legend
24 plt.legend()
---> 25 legend = g._legend
26 new_labels = ['Female', 'Male']
27 for t, l in zip(legend.texts, new_labels): t.set_text(l)
AttributeError: 'AxesSubplot' object has no attribute '_legend'
This is how my currrent code looks like:
## store them in different variable names
X = salary['years']
y = salary['salary']
g = salary['gender']
# prepare the scatterplot
sns.set()
plt.figure(figsize=(10,10))
g = sns.scatterplot(x=salary.years, y=salary.salary, data=salary, hue='gender')
# equations of the models
model1 = 50 + 2.776962335386217*X
model2 = 60.019802 + 2.214645*X
model3_male = 60.014922 + 2.179305*X + 1.040140*1
model3_female = 60.014922 + 2.179305*X + 1.040140*0
# plot the scatterplots
plt.plot(X, model1, color='r', label='Model 1')
plt.plot(X, model2, color='g', label='Model 2')
plt.plot(X, model3_male, color='b', label='Model 3(Male)')
plt.plot(X, model3_female, color='y', label='Model 3(Female)')
# plot the legend
plt.legend()
legend = g._legend
new_labels = ['Female', 'Male']
for t, l in zip(legend.texts, new_labels): t.set_text(l)
# set the title
plt.title('Scatterplot of salary and model fits')
plt.show()
I don't have your data, so I generate some by my own:
gender salary years
male 40000 1
male 32000 2
male 45000 3
male 54000 4
female 72000 5
female 62000 6
female 92000 7
female 55000 8
female 35000 9
female 48000 10
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
salary = pd.read_csv("1.csv", delim_whitespace=True)
print(salary)
X = salary['years']
y = salary['salary']
g = salary['gender']
# prepare the scatterplot
sns.set()
plt.figure(figsize=(10,10))
g = sns.scatterplot(x=salary.years, y=salary.salary, data=salary, hue='gender')
# equations of the models
model1 = 50 + 2.776962335386217*X
model2 = 60.019802 + 2.214645*X
model3_male = 60.014922 + 2.179305*X + 1.040140*1
model3_female = 60.014922 + 2.179305*X + 1.040140*0
# plot the scatterplots
plt.plot(X, model1, color='r', label='Model 1')
plt.plot(X, model2, color='g', label='Model 2')
plt.plot(X, model3_male, color='b', label='Model 3(Male)')
plt.plot(X, model3_female, color='y', label='Model 3(Female)')
# plot the legend
plt.legend()
# set the title
plt.title('Scatterplot of salary and model fits')
plt.show()
It works fine. So I guess values in your gender column are 0 or 1. In that case, you can do the following before g = salary['gender'] to replace 0 with male and 1 with female:
salary['gender'] = salary['gender'].map({1: 'female', 0: 'male'})
Back to your error:
---> 25 legend = g._legend
26 new_labels = ['Female', 'Male']
27 for t, l in zip(legend.texts, new_labels): t.set_text(l)
AttributeError: 'AxesSubplot' object has no attribute '_legend'
g returned by sns.scatterplot is class matplotlib.axes.Axes. To get lengend object from it, you need to use ax.get_legend() or ax.legend() rather than ax._legend. You can follow the officail Legend guide documentation.
legend = g.legend()
new_labels = ['Female', 'Male']
for t, l in zip(legend.texts[-2:], new_labels): t.set_text(l)
I need to highlight a specific point in each boxplot. For example, I want to highlight the point where petal_width is 0.8 in a boxplot chart for petal_length for each species.
Here is the example:
iris = sns.load_dataset('iris')
##Create three points where petal_width is 0.8 for each species
iris_2 = pd.DataFrame({'sepal_length':Series([1,2,3],dtype='float32'), 'sepal_width':Series([1.1,2.1,3.1],dtype='float32'),
'petal_length':Series([1,2,3],dtype='float32'), 'petal_width':Series([0.8,0.8,0.8],dtype='float32'),
'species':Series(['setosa','versicolor','virginica'])})
iris_all = pd.concat([iris, iris_2]).reset_index(drop = True)
sns.boxplot(x='species', y = 'petal_length', data = iris_all)
sns.regplot(x= iris_all['species'][iris_all['petal_width'] == 0.8],
y= iris_all['petal_length'][iris_all['petal_width'] == 0.8], scatter=True, fit_reg=False, marker='o',
scatter_kws={"s": 100})
But the code doesn't work. I wonder how I can correct it. Thanks.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
iris = sns.load_dataset('iris')
# Create three points where petal_width is 0.8 for each species
iris_2 = pd.DataFrame(
{'sepal_length': pd.Series([1, 2, 3], dtype='float32'), 'sepal_width': pd.Series([1.1, 2.1, 3.1], dtype='float32'),
'petal_length': pd.Series([1, 2, 3], dtype='float32'), 'petal_width': pd.Series([0.8, 0.8, 0.8], dtype='float32'),
'species': pd.Series(['setosa', 'versicolor', 'virginica'])})
iris_all = pd.concat([iris, iris_2]).reset_index(drop=True)
sns.boxplot(x='species', y='petal_length', data=iris_all)
sns.regplot(x=iris_all['species'][(iris_all['petal_width'] > 0.79) & (iris_all['petal_width'] < 0.81)],
y=iris_all['petal_length'][(iris_all['petal_width'] > 0.79) & (iris_all['petal_width'] < 0.81)],
color='blue',
scatter=True, fit_reg=False,
marker='+',
scatter_kws={"s": 100})
plt.show()
I want to draw a simple choropleth map of NYC with binned # of yellow cab rides. My gpd.DataFrame looks like this:
bin cnt shape
0 15 1 POLYGON ((-74.25559 40.62194, -74.24448 40.621...
1 16 1 POLYGON ((-74.25559 40.63033, -74.24448 40.630...
2 25 1 POLYGON ((-74.25559 40.70582, -74.24448 40.705...
3 27 1 POLYGON ((-74.25559 40.72260, -74.24448 40.722...
4 32 12 POLYGON ((-74.25559 40.76454, -74.24448 40.764...
where bin is a number of region, cnt is target variable of my plot and shape column is just a series of shapely rectangles composing one covering the whole New York.
Drawing NYC from shapefile:
usa = gpd.read_file('shapefiles/gadm36_USA_2.shp')[['NAME_1', 'NAME_2', 'geometry']]
nyc = usa[usa.NAME_1 == 'New York']
ax = plt.axes([0, 0, 2, 2], projection=ccrs.PlateCarree())
ax.set_extent([-74.25559, -73.70001, 40.49612, 40.91553], ccrs.Geodetic())
ax.add_geometries(nyc.geometry.values,
ccrs.PlateCarree(),
facecolor='#1A237E');
Drawing choropleth alone works fine:
gdf.plot(column='cnt',
cmap='inferno',
scheme='natural_breaks', k=10,
legend=True)
But if I put ax parameter:
gdf.plot(ax=ax, ...)
the output is
<Figure size 432x288 with 0 Axes>
EDIT:
Got it working with following code:
from matplotlib.colors import ListedColormap
cmap = plt.get_cmap('summer')
my_cmap = cmap(np.arange(cmap.N))
my_cmap[:,-1] = np.full((cmap.N, ), 0.75)
my_cmap = ListedColormap(my_cmap)
gax = gdf.plot(column='cnt',
cmap=my_cmap,
scheme='natural_breaks', k=10,
figsize=(16,10),
legend=True,
legend_kwds=dict(loc='best'))
gax.set_title('# of yellow cab rides in NYC', fontdict={'fontsize': 20}, loc='center');
nyc.plot(ax=gax,
color='#141414',
zorder=0)
gax.set_xlim(-74.25559, -73.70001)
gax.set_ylim(40.49612, 40.91553)
When only doing this with .plot calls from geopandas this seems to work fine. Had to make up some data as I don't have yours. Let me know if this helps somehow. Code example should work as is in IPython.
%matplotlib inline
import geopandas as gpd
import numpy as np
from shapely.geometry import Polygon
from random import random
crs = {'init': 'epsg:4326'}
num_squares = 10
# load natural earth shapes
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# create random choropleth
minx, miny, maxx, maxy = world.geometry.total_bounds
x_coords = np.linspace(minx, maxx, num_squares+1)
y_coords = np.linspace(miny, maxy, num_squares+1)
polygons = [Polygon([[x_coords[i], y_coords[j]],
[x_coords[i+1], y_coords[j]],
[x_coords[i+1], y_coords[j+1]],
[x_coords[i], y_coords[j+1]]]) for i in
range(num_squares) for j in range(num_squares)]
vals = [random() for i in range(num_squares) for j in range(num_squares)]
choro_gdf = gpd.GeoDataFrame({'cnt' : vals, 'geometry' : polygons})
choro_gdf.crs = crs
# now plot both together
ax = choro_gdf.plot(column='cnt',
cmap='inferno',
scheme='natural_breaks', k=10,
#legend=True
)
world.plot(ax=ax)
This should give you something like the following
--Edit, if you're worried about setting the correct limits (as you're doing with the boroughs), please just paste the following to the end of the code (for example)
ax.set_xlim(0, 50)
ax.set_ylim(0, 25)
This should then give you: